Abstract:
Machine translation is a field of linguistic study that uses to translate from one natural source language into another automatically using computer software. This research study considers using unsupervised segmentation for the SMT Amharic into Tigrigna.
The experiment has been conducted using 14,231 parallel sentences collected from parliamentary documents, educational manuals, Health documents and from the Bible. The parallel collected data has been split randomly into 90% of training set and 10% of testing and tuning set. Tigrigna monolingual corpus consisting 25,875 sentences were collected from DWOT, Tigray Mass Media Agency (TMMA) and from online Tigrigna written texts in order to develop the Language Model that can be create fluency of the target output text. Moses open source statistical machine translation system has been used for the experiment to train, tuning and decoding. The parallel data was aligned using the Giza++ toolkit. For building the language model SRILM was used and the Morfessor was used to segment the data.
The first experiment that applies for the baseline of the translation system was score 35.24 % using BLEU score that is greater than 30 which indicates understandable translation. The second experiment was done using unsupervised segmentation for both Amharic and Tigrigna data used in the baseline translation system including the language model for Tigrigna. As a result, we get 2.74% increase compared to baseline. from this we conclude morpheme based units leads to better performance in Amharic-Tigrigna translation system. Finally, we recommended to focus the translation system using supervised by adding more data for future work.