Abstract:
Part-of-speech tagging is the method of assigning a word class information to each word into naturally
occurring text. It is the most fundamental and basic task almost in all natural language processing. Part-of-speech tagging is done in many languages including Amharic, Afaan Oromo, Tigrigna and others to simplify
the languages in computer processing, as a result, it serves as a foundation for developing parsers, machine
translations, and speech recognition systems for local languages. However, ex isting POS taggers cannot be
directly taken to the Goofatho language. Because Goofatho language is morphologically different from
others with language features, word forming and semantics. The Goofatho Language is a member of the
Omotic language family, belonging to the North Ometo Cluster, and is spoken by the people of Gofa as well
as the different communities living in the geographical location of Gofa zone in SNNPR. This language has
strong influence on the socio-cultural and linguistic identity of the people and the development of POS
tagger for Goofatho is making the language suitable for NLP. For Goo fatho language there is no work done
in the areas of NLP, notably POS tagging. Therefore, this research contributes in filling the gaps and laying
the base for developing higher level applications of NLP for Goofatho language. To develop the POS tagger,
this study employs a hybrid model, combining HMM ,regexp and rule based tagger.
The literature on Goofatho grammar and morphology is reviewed in order to better grasp the language's
nature and to find possible tag sets. There is no premade corpus for the Goofatho language. Thus, with the
help of linguistic experts 24 tag sets were developed and 10,307 words (928 sentences) were manually
tagged. The corpus was then partitioned into two for training and testing purposes. And 90% of the tagged
corpus was used to train both HMM and rule-based tagger to yield probabilities and transformation rules. On
the proposed hybrid tagger, we used regexp tagger which deals with some suffix and unknown words and we
adopt N-gram tagger as initial tagger for rule-based with regexp tagger as back off. The HMM tagger
generates probabilities using Viterbi algorithm, while the rule-based tagger generates a series of
transformation rules using Brill tagger based on the corpus. As a result, the hybrid tagger (which combines
HMM, regexp and rule-based tagger) provides the best word class information to the raw Goofatho texts.
Different experiments are used to evaluate the performance of the techniques, such as HMM, rule-based, and
hybrid taggers. As a result, the HMM tagger, Rule based tagger, and Hybrid tagger achieve 65.1%, 83%, and
84.2% performance, respectively. Therefore, the hybrid tagger outperforms both the HMM tagger and the
Rule-based tagger when applied separately. This shows that using the hybrid method is effective way on
developing POS tagging for low resourced languages.
Keywords:NLP, POS, HMM, Rule based and Hybrid approach