Abstract:
In Amharic writing, there are characters with the same sound but different shapes, called
homophone characters. The current trend in Amharic NLP tasks is to normalize those
homophones into a single representation. This means instead of ሀ(hā), ሃ(ha), ሐ(ḥā), ሓ(ḥa),
ኀ(ḫā), ኃ(ḫā), and ኻ(ẖa), the symbol ሀ(hā) should be used; instead of አ(ā), ኣ(a), ዐ(‘ā), and
ዓ(‘a), the symbol አ(ā) should be used; etc. Normalization was done by the assumption that
homophones have the same sound and they are repetitive alphabets. However, the impact
of homophone normalizations for semantic models and NLP applications is not well
explored. Sometimes, when we make homophone normalization, there will be a meaning
change, the standard writing system will be neglected, and posing new challenges to NLP
tools. For example, the word ድህነት (dihinneti) is “poverty” and ድኅነት (diḫineti) is
“salvage.” When homophone normalization is applied, the different meaning of the word
ድኅነት (diḫineti-salvage) will be transformed into the dominant variant, ድህነት (poverty). To
study the impacts of homophone normalization, we develop pre-trained embedding models
for Amharic in regular and normalized approaches. The embedding models we build
include Word2Vec, fastText, RoBERTa, and FLAIR. Moreover, we investigate the
impacts of normalization on some of the core Amharic NLP tasks. For PoS tagging, a
model that employs regular FLAIR performs better than the normalized model, achieving
an F1-score of 77.16%. For NER, a model from normalized Word2Vec (CBOW) performs
better, with an F1-score of 74.48%. For sentiment analysis, the model from regular
RoBERTa outperforms, an F1-score of 59.81%. For IR systems, we achieve an F1-score
of 89.73% using the normalized text. The results show that normalization is highly
dependent on the NLP application. Most Amharic NLP models developed from regular
embeddings show better performance than models from normalized embeddings. The main
contributions of this work include the findings that normalization is task-dependent, we
achieve state-of-the-art performance, and we contribute publication of pre-trained
embedding models.
Key Words: - Homophone normalization, semantic models, pre-trained embedding
models, NLP applications, Fidäl scripts, text normalization