DESIGN AND DEVELOP PART OF SPEECH TAGGING FOR  GOOFATHO LANGUAGE USING HYBRID APPROACH (HMM AND  RULE BASED)

MULUKEN, TOMAS

DESIGN AND DEVELOP PART OF SPEECH TAGGING FOR GOOFATHO LANGUAGE USING HYBRID APPROACH (HMM AND RULE BASED)

MULUKEN, TOMAS

URI: http://ir.bdu.edu.et/handle/123456789/14396

Date: 2022-03

Abstract:

Part-of-speech tagging is the method of assigning a word class information to each word into naturally occurring text. It is the most fundamental and basic task almost in all natural language processing. Part-of-speech tagging is done in many languages including Amharic, Afaan Oromo, Tigrigna and others to simplify the languages in computer processing, as a result, it serves as a foundation for developing parsers, machine translations, and speech recognition systems for local languages. However, ex isting POS taggers cannot be directly taken to the Goofatho language. Because Goofatho language is morphologically different from others with language features, word forming and semantics. The Goofatho Language is a member of the Omotic language family, belonging to the North Ometo Cluster, and is spoken by the people of Gofa as well as the different communities living in the geographical location of Gofa zone in SNNPR. This language has strong influence on the socio-cultural and linguistic identity of the people and the development of POS tagger for Goofatho is making the language suitable for NLP. For Goo fatho language there is no work done in the areas of NLP, notably POS tagging. Therefore, this research contributes in filling the gaps and laying the base for developing higher level applications of NLP for Goofatho language. To develop the POS tagger, this study employs a hybrid model, combining HMM ,regexp and rule based tagger. The literature on Goofatho grammar and morphology is reviewed in order to better grasp the language's nature and to find possible tag sets. There is no premade corpus for the Goofatho language. Thus, with the help of linguistic experts 24 tag sets were developed and 10,307 words (928 sentences) were manually tagged. The corpus was then partitioned into two for training and testing purposes. And 90% of the tagged corpus was used to train both HMM and rule-based tagger to yield probabilities and transformation rules. On the proposed hybrid tagger, we used regexp tagger which deals with some suffix and unknown words and we adopt N-gram tagger as initial tagger for rule-based with regexp tagger as back off. The HMM tagger generates probabilities using Viterbi algorithm, while the rule-based tagger generates a series of transformation rules using Brill tagger based on the corpus. As a result, the hybrid tagger (which combines HMM, regexp and rule-based tagger) provides the best word class information to the raw Goofatho texts. Different experiments are used to evaluate the performance of the techniques, such as HMM, rule-based, and hybrid taggers. As a result, the HMM tagger, Rule based tagger, and Hybrid tagger achieve 65.1%, 83%, and 84.2% performance, respectively. Therefore, the hybrid tagger outperforms both the HMM tagger and the Rule-based tagger when applied separately. This shows that using the hybrid method is effective way on developing POS tagging for low resourced languages. Keywords:NLP, POS, HMM, Rule based and Hybrid approach

Show full item record