Abstract:
The main objective of natural language processing is to make computers perform tasks that require the involvement of human. This helps to save labor force, cost and time devoted to do such tasks. These goals are achieved by implementing activities such as text classification, speech recognition, and information retrieval. One of the natural language processing tasks is text classification. However, classification accuracy decreases and computational complexity increase as the number of categories increases. The aim of this study is to explore and design a dimensionality reduction scheme for Amharic document classification using feature selection and feature extraction.
To achieve the objective, to design effective model and to know the state of the art, different literature were reviewed. Then designed dimension reduction scheme consists information gain, X-square and document frequency as feature selection with local thresholding and Principal Component Analysis (PCA) for further refinement of the selected feature. Software like NetBeans 8.1and Python were used to pre-process and design the artifact model respectively. Finally, the new dimension reduction scheme is evaluated by Amharic news document and achieves 82.77% accuracy. The new dimension reduction scheme is compared with the other dimensionality reduction system and feature merging strategies. As a result the new scheme reduces the number of features produced by information gain, X-square and document frequency by 64.07%, 74% and 50.63% respectively, and the training time increases only by 20 seconds as the amount of categories increase from three to thirteen.
Even though, the proposed dimension reduction lowered the rate of increment of computational time, the classification accuracy still decreases at a decreasing rate, as we reduce the feature size to save the computational complexity. As a result, there is a need to apply genetic algorithms over the selected features since it determine the removal of the features by seeing the classification accuracy.