Abstract:
Thesaurus is a reference of words or of information about a particular field or set of concepts, especially, a tome of words and their synonyms or a list of subject-headings or descriptors usually with a cross-reference system for use in the organization of a collectionof documents for reference and retrieval. One of the major problems of modern information retrieval systems is the vocabulary problem that concerns with the discrepancies between terms used for describing documents and the terms used by the searcher to describe their information need which forms the information overload or information mismatch. One way of handling the vocabulary problem is using a thesaurus that shows the relationships between terms and query expansion which provides us the alternative terms for query to improve the effectiveness of retrieval. Since the manual thesaurus construction is a labor-intensive task and hence also expensive to build and hard to update in timely manner, Afan Oromo automatic thesaurus is implemented by using the term-clustering approach. In this research, 36869 selected words from the collected document are used and are suggested to improve the expansion process and to get more relevance documents for the user's query.
The performance of the experiment is very encouraging and promising as the accuracy of the system performance is 56.6% on Afan Oromo documents. And also 73.11% of the terms in the collection are registered to be similar. More challenge here is, the complexity of Afan Oromo which results in under or over stemmed and this is due to the non-proper preprocessing of the document. The performance and the accuracy of this system is improved if the document is properly preprocessed and more effective in large collections over multiple domains. The quality of the cluster is measured by intra-cluster and interclustering techniques and the result registers 1.33.