dc.description.abstract |
Named Entity Recognition (NER) is an important task in several aspects of the processing of information conveyed by natural languages. It is the task of detecting and classifying Named Entities (NEs), which are unique identifiers of proper nouns, times and measurements. NER has found many application areas such as information retrieval and extraction systems, automatic summarization, question-answering system, machine translation and others. This thesis work deals with Amharic NER system using statistical approaches. The experiments were conducted on the total of 18,191 tokens for training data, development data, and test data following CoNLL2002 format, BIO tagging scheme and LingPipe library is used as a tool for developing the NER for Amharic language. We have conducted five different kinds of experiments by varying our feature sets and tested the performance of the proposed Conditional Random Field (CRF) based Amharic Named Entity Recognition (ANER) system. The highest performance achieved in this work is 78.24% which is the features of baseline experiments by increasing Previous and next words' prefix and suffix with maximum length from 2 to 4 and the worst is 68.29% which is the experiment without Part Of speech (POS) tags of tokens. The previous work on ANER has performed with 74.61%, 80% and 85.9% on a different tool, algorithms, corpus size and feature set. But all those systems does not work for the miscellaneous (MISC) entity mentioned. The previous researchers attempt to focuses only on the detection and classification of person, organization, and location names in different algorithm. This work includes a new tasks that the detection and classification of time and measurement. Due to these reasons we could not compare this work with the previous works, since this work is the first attempt to detect and classify time and measurement NE type in Amharic language. Based on the experimental results we have concluded that the tool and algorithm might have impact on the performance of ANERs and the combination of POS tags of tokens, prefix and suffix with a length of four are important features to recognize NEs. Conducting further study why these feature are the important features to recognize NEs. |
en_US |