Abstract:
Word sense disambiguation (WSD) plays an important role in different NLP applications
such as information extraction, information retrieval, machine translation, and
lexicography. The manual disambiguation process by humans is tedious, prone to errors,
expensive, and time-consuming. Recent research in Amharic WSD used mostly
handcrafted rules. Such works do not help to learn different representations of the target
word (ambiguous word) from data automatically. Moreover, such a manual disambiguation
approach looks at a limited length of surrounding words from the sentence. The main
drawback of previous works is that the sense of the word will not be detected from the
synset list unless the word is explicitly mentioned. Our study explores and designs the
Amharic word sense disambiguation model by employing transformer-based contextual
embeddings. More specifically, we have exploited the different operations provided by the
transformer models, namely AmRoBERTa.
As there is no standard sense-tagged Amharic text dataset for the Amharic WSD task, we
first compiled 800 ambiguous words from different sources, including the Amharic
dictionary, Amharic textbooks (Grade 7-12), and the Abissinica online dictionary.
Furthermore, we collect more than 33k sentences that contain those ambiguous words. The
33k sentences are used to finetune our transformer-based RoBERTa model
(AmRoBERTa). We conduct two types of annotation for our WSD experiments. First,
using linguistic experts, we annotate 10k sentences for 7 types of word relations
(synonymy, hyponymy, hypernymy, meronomy, holonomy, toponymy, and homonymy).
For the WSD disambiguation experiment, we first choose 10 target words and annotate a
total of 1000 sentences with their correct sense using the WebAnno annotation tool. Each
sentence with one target ambiguous word is annotated by two users and one curator
vii
(adjudicator). As preparing glosses for each sense is time taking, we prepare 100 glosses
for the selected 10 targets.
We conduct two main experiments, word relationship classification using the CNN, BiLSTM, and BERT models and WDS disambiguation using the AmRoBERTa model with
sentence similarity measures. For the classification task, the CNN, Bi-LSTM, and BERTbased classification models achieve an accuracy of 90%, 88%, and 93% respectively. For
the WSD task, we have employed the FLAIR document embedding framework to embed
the target sentences and glosses separately. We then compute the similarity of the target
sentence with the glosses embedding. The gloss with the higher score disambiguates the
target sentence. Our model was able to achieve an F1 score of 71%. Due to time constraints
and the lack of Amharic WordNet, we could not experiment with a large number of training
datasets. In the future, we plan to at least compile glosses for the 1000 sentences annotated
using WebAnno and report the performance.
Keywords: Word Sense Disambiguation, Transfer Learning, Neural network, pre-trained
Language Model, Natural Language Preprocessing, Morphological Analyzer, Amharic
WSD.