Impacts of Homophone Normalization on Semantic Models for  Amharic

Tadesse, Destaw Belay

Impacts of Homophone Normalization on Semantic Models for Amharic

Tadesse, Destaw Belay

URI: http://ir.bdu.edu.et/handle/123456789/13216

Date: 2021-09

Abstract:

In Amharic writing, there are characters with the same sound but different shapes, called homophone characters. The current trend in Amharic NLP tasks is to normalize those homophones into a single representation. This means instead of ሀ(hā), ሃ(ha), ሐ(ḥā), ሓ(ḥa), ኀ(ḫā), ኃ(ḫā), and ኻ(ẖa), the symbol ሀ(hā) should be used; instead of አ(ā), ኣ(a), ዐ(‘ā), and ዓ(‘a), the symbol አ(ā) should be used; etc. Normalization was done by the assumption that homophones have the same sound and they are repetitive alphabets. However, the impact of homophone normalizations for semantic models and NLP applications is not well explored. Sometimes, when we make homophone normalization, there will be a meaning change, the standard writing system will be neglected, and posing new challenges to NLP tools. For example, the word ድህነት (dihinneti) is “poverty” and ድኅነት (diḫineti) is “salvage.” When homophone normalization is applied, the different meaning of the word ድኅነት (diḫineti-salvage) will be transformed into the dominant variant, ድህነት (poverty). To study the impacts of homophone normalization, we develop pre-trained embedding models for Amharic in regular and normalized approaches. The embedding models we build include Word2Vec, fastText, RoBERTa, and FLAIR. Moreover, we investigate the impacts of normalization on some of the core Amharic NLP tasks. For PoS tagging, a model that employs regular FLAIR performs better than the normalized model, achieving an F1-score of 77.16%. For NER, a model from normalized Word2Vec (CBOW) performs better, with an F1-score of 74.48%. For sentiment analysis, the model from regular RoBERTa outperforms, an F1-score of 59.81%. For IR systems, we achieve an F1-score of 89.73% using the normalized text. The results show that normalization is highly dependent on the NLP application. Most Amharic NLP models developed from regular embeddings show better performance than models from normalized embeddings. The main contributions of this work include the findings that normalization is task-dependent, we achieve state-of-the-art performance, and we contribute publication of pre-trained embedding models. Key Words: - Homophone normalization, semantic models, pre-trained embedding models, NLP applications, Fidäl scripts, text normalization

Show full item record