Abstract:
The requirement gathering and design phases play important roles in the software
development lifecycle. One of the requirements gathering phases is identifying the use
case and actors. A use case is a specification of a set of actions that are performed by a
system and that give observable results for the value of one or more actors or other
stakeholders in the system. However, the main problem arises when identifying use
cases from the clients' use of ambiguous language when speaking with software
analysts, which might result in a misunderstanding of the software functional
requirements due to a literal interpretation. In the existing literature, the use case
identification has been studied most of the studies were done by natural language
processing (NLP) with identified use cases and actor by case study, heuristic rule
and checklist methods, which led to the wastage of time and resources and the others
also did not cluster the types of relationships between use cases in the requirement text.
To address these gaps, we set the objective of identifying use cases and actors as well
as cluster relationships between use cases in the requirements text using different
machine learning approaches. To perform this study, we used an experimental research
approach. We applied machine learning techniques, such as SVM, NB, LR, and RF, to
build the model. For the experiment, we prepared a dataset of 1884 requirement texts,
which were labeled by Boost Software Development PLC experts for identifying actors
and use cases. For clustering relationships, we used 1600 an unlabeled dataset that
could be experimented with by unsupervised clustering algorithm, in which DBSCAN,
K-mean, and hierarchical clustering were applied. The datasets that were fed to our
proposed model were pre-processed using Natural Language Processing (NLP)
principles, and we used TF-IDF and word2vec feature extraction methods. Based on
our experiment, we observed that for use case identification logistic regression,
Random Forest, and SVM had the best accuracy of 98%, whereas NB had an accuracy
of 95%. For actor identification, SVM, RF, and LR had the best accuracy of 99%,
whereas NB had an accuracy of 97%. For relationship type cluster, K-means had
the best silhouette score of 0.76, which was better than DBSCAN and hierarchical
clustering due to the small dataset that preferable to our research.
Keyword: -Requirements text, actor, use cases, relationship types, machine learning,
natural language processing