Abstract:
Idiomatic expressions are a natural part of all languages and a common part of our everyday conversation. It is difficult to understand the meaning of idioms since they cannot be deduced directly from the word which they are created. Natural Language Processing researche has been influenced by the existence of idioms. It has been shown that idiom affects NLP researches such as machine translation, semantic analysis, sentiment analysis, information retrieval, question answering and next word prediction. Other languages like English, Chinese, Japanese, Indian idioms are identified through different methods in different researches, but for the Amharic language, there is no research to identify idioms. Since there is no standard model for identifying Amharic idioms, this study aimed to develop an idiom identification model for the Amharic language using a supervised machine learning approach. One thousand datasets are collected from Amharic idiom books “የአማረኛ ፈሊጦች” and different Amharic documents. Vector representation of expressions using python programming was used to prepare a compatible dataset for the identification model. We contributed that digitalized the hard copy Amharic idiom book to computerized manner and used different concerned bodies as a look up table to do their own NLP task. This model helps NLP researchers to decide the phrases are idiomatic or literal. The developed model achieved a 97.5% accuracy result in the testing dataset when we employed the KNN algorithm.