LEXICAL COMPLEXITY DETECTION AND SIMPLIFICATION IN AMHARIC TEXT USING MACHINE LEARNING APPROACH

GEBREGZIABIHIER, NIGUSIE BIRHANE

LEXICAL COMPLEXITY DETECTION AND SIMPLIFICATION IN AMHARIC TEXT USING MACHINE LEARNING APPROACH

GEBREGZIABIHIER, NIGUSIE BIRHANE

URI: http://ir.bdu.edu.et/handle/123456789/14435

Date: 2022-08

Abstract:

Text complexity is the level of difficulty of the document for understanding by the target readers. One common type of this text complexity is lexical complexity which can cause comprehensibility and understandability problems for second language learners, low literacy readers and children. Furthermore, it is challenging for NLP applications. Amharic language contains such complex and unfamiliar words which leads low literacy readers to misunderstand the document and that challenges NLP applications. To reduce this type of text complexity for low resourced and morphologically reached language Amharic, we have designed a lexical complexity detection and simplification model using a machine learning approach. We develop three subsequent models. The first model is used to classify Amharic text lexical complexity which is trained using 19k sentences. To embed these sentences, we have built Word2Vec embedding model using 9756 vocabularies. The second model is developed using 1002 vocabularies for detecting specific complex terms. Lastly, we have built word2vec (CBOW) and RoBERTa models using 57k sentences for simplification generation and ranking. The experimental result of Amharic text complexity classification models scores an accuracy of 85% (SVM), 81.5%(RF), 86%(LSTM), 88%(BiLSTM), and 91%(BERT). Based on the experimental result the BERT model has better classification accuracy, because of its ability to handle long term information dependency. For the specific complex term detection and simplification generation, Word2Vec has better similarity result. It scores 87%, 92%, 67%, 84% and 53% top ranked simple terms for five test complex sentences. Whereas RoBERTa has less prediction ability with 54%, 17%, 0.9%, 6%, and 8% prediction generation for these five complex sentences. Due to time and resource constraint we have used limited number of complex terms and the RoBERTa model is not trained well for mask word prediction. So, increase complex training data to improve the performance of the model, and address the syntactic complexity of Amharic text are our recommendation for future research works. Keywords: - Text complexity, Complexity detection, Supervised classification, Lexical complexity, lexical simplification

Show full item record