dc.description.abstract |
The ultimate goal of an automatic speech recognition system is towards developing a model that automatically converts speech utterance into sequences of text. Numerus experimental research has been conducted in the area of ASR for technologically favored languages. However, researches on Amharic speech recognition systems began in the year 2001 by Solomon [7] who developed an isolated CV syllable recognition system for Amharic. The authors in reference [8, 9, 10, 11, 13, 14, 15, 16, 17] have attempted to build Amharic ASR using domain dependent spontaneous and read speech corpus data by utilizing the HMM approach which heavily rely on the several individual components that are trained independent of each other. Several researches have also been conducted on end-to-end ASR systems for technologically favored languages over the past decade. However, only a few researches have been conducted to develop end-to-end Amharic ASR. Recently Hailu et al. [115], Emiru et al. [145] and Solomon et al. [2], and Yonas [3] have conducted ASR research by applying various end-to-end approaches. Solomon et al. [2], and Yonas [3] have conducted their research to address the resource scarcity issue by using multilingual E2E, and transfer learning approach. This research can be considered as an expansion of works on Solomon et al. [2], and Yonas [3] with the objective of enhancing the recognition performance of Amharic LVCSR through applying raw audio data augmentation techniques to address the resource scarcity issues.
This research study uses an empirical study to explore the effect of raw audio data augmentation on enhancing the recognition performance of Amharic LVCSR developed using an end-to-end LAS approach. We have developed the baseline and aggregate augmented models that are trained on the original and the aggregate augmented training datasets respectively. Using the CER, we have also compared the performance of both models. Both models were tested using the dataset that contain 3296 utterances partitioned from the corpus that contains 100 hours long read speech dataset. Using the time stretching, pitch shifting, and noise adding raw audio data augmentation techniques applied on the original training dataset to produce aggregate augmented training dataset. lexicons and language models were not used to develop both models. We reported a CER of 12.35% and 8.23% on the baseline and aggregate augmented model respectively. The experiment result showed that the aggregate augmented model outperforms the baseline model. The aggregate augmented model achieved 4.12% CER reduction over the baseline model.
Keywords: ASR, Amharic LVCSR, LAS, and raw audio augmentation techniques. |
en_US |