Abstract:
Text complexity is the level of difficulty of the document for understanding by the target
readers. One common type of this text complexity is lexical complexity which can cause
comprehensibility and understandability problems for second language learners, low
literacy readers and children. Furthermore, it is challenging for NLP applications. Amharic
language contains such complex and unfamiliar words which leads low literacy readers to
misunderstand the document and that challenges NLP applications. To reduce this type of
text complexity for low resourced and morphologically reached language Amharic, we
have designed a lexical complexity detection and simplification model using a machine
learning approach. We develop three subsequent models. The first model is used to classify
Amharic text lexical complexity which is trained using 19k sentences. To embed these
sentences, we have built Word2Vec embedding model using 9756 vocabularies. The
second model is developed using 1002 vocabularies for detecting specific complex terms.
Lastly, we have built word2vec (CBOW) and RoBERTa models using 57k sentences for
simplification generation and ranking. The experimental result of Amharic text complexity
classification models scores an accuracy of 85% (SVM), 81.5%(RF), 86%(LSTM),
88%(BiLSTM), and 91%(BERT). Based on the experimental result the BERT model has
better classification accuracy, because of its ability to handle long term information
dependency. For the specific complex term detection and simplification generation,
Word2Vec has better similarity result. It scores 87%, 92%, 67%, 84% and 53% top ranked
simple terms for five test complex sentences. Whereas RoBERTa has less prediction ability
with 54%, 17%, 0.9%, 6%, and 8% prediction generation for these five complex sentences.
Due to time and resource constraint we have used limited number of complex terms and
the RoBERTa model is not trained well for mask word prediction. So, increase complex
training data to improve the performance of the model, and address the syntactic
complexity of Amharic text are our recommendation for future research works.
Keywords: - Text complexity, Complexity detection, Supervised classification, Lexical
complexity, lexical simplification