Abstract:
As the growth of unstructured and semi structured documents in an electronic media is increasing from time to time, a tool that can extract relevant data to facilitate decision making is becoming crucial. Information Extraction (IE) is concerned with the automatic extraction of facts from text and stores them in a database for easy use and management of the data. As the first research work on IE from Afan Oromo text, we designed a model that deal with Infrastructure news domains in the Oromo language. The proposed model has document preprocessing, learning and extraction and post processing as its main components. The preprocessing component is responsible for tokenization and parsing of news texts. The learning and extraction component extracts candidate texts from the news text and learns a classification model that will be used to predict the category of the candidate text. The post processing component is responsible for the formatting of the extracted data.
In this work Recall, Precision and F-measure are used as evaluation metrics for Afan Oromo Text Information Extraction (AOTIE). Being trained and tested on the dataset of size 3169 tokens, AOTIE performed 79.5% Precision, 80.5% Recall and 80% F-measure. These results are used as a baseline to experiment on AOTIE. We set up two main experimentation scenarios to experiment on AOTIE. The first scenario is conducted by developing a gazetteer. The second scenario is aimed at observing the influence of Afan Oromo grammatical structure. Both scenarios showed that, the performance of AOTIE is mostly dependent on grammatical structure of Afan Oromo.