Abstract:
The presence of Natural language processing (NLP) discipline allows computers to understand human language and process them. It provides basic role in different research tasks like part of speech tagger (POST), spelling correction and parsing, Machine translation, grammar checking, text summarization and so on. Among them POST is one of the foundation for other NLP tasks as this is used as preprocessing component. The task of POST is labeling each word to corresponding part of speech category so as to assign part of speech tags to words in a sentence.
Several parts of speech taggers were developed for local and foreign languages. However, these POS taggers can‟t be directly used for other language. As far as researcher‟s knowledge is concerned, there is no part of speech tagger developed for Guragigna language. So, the aim of this study is to develop part of speech tagger in Guragigna language. To do this first different literatures related to this work are reviewed to understand the nature and behavior of the language, and to identify possible tagsets. As a result, 17 tagsets are identified. In order to train and evaluate the performance of tagger 6,745 words are collected. The main source of our corpus is from Guragigna fiction and editorial category.
In order to develop the tagger, Hidden Markov model (HMM) approach and hybrid approach which is a combination of rule based and HMM based are used. Initially raw Guragigna text is tagged by HMM tagger based on the most probable path for given sentence of word. After that rule based tagger is used to correct HMM tagger based on predefined set of rules. The algorithm used for HMM is Viterbi. Additionally in our experiment we also use CRF approach.
For experiment analysis, we used 90% of the data for training and the rest 10% for testing. Different experiments are conducted for each tagger independently. Having tested on the same data the performance analyses of the taggers are 66.56, 74.46 and 78.42 for CRF, HMM tagger and Hybrid tagger respectively.
Increasing the size of training data and examining the tagger influences the result. Result from our experiment shows that adding of rule based tagger performs better result than HMM tagger alone.