dc.description.abstract |
Text-to-Speech (TTS) translation is a process that generates synthetic speech artificially for a variety of uses, including telephone services, reading electronic documents, and speaking models for handicapped people. Currently, many text-to-speech translation models are available for different languages such as English, Afan Oromo, Tigrigna, and Welaytta. However, research on the Amharic language is extremely rare, so the study suffers from some limitations. Speech is generated from natural language text using deep learning approaches. Standard and nonstandard words, such as numbers, abbreviations, money, and dates, both SWs and NSWs found in written texts in a language. These NSWs cannot be detected by an application of the "letter-to-sound" rule. In general, the previous work converted text to speech using a rule-based and Hidden Markov Model. The main problem of HMM-based synthesis is that certain features for speech synthesis are hard coded by humans, but they are not necessarily the best features to synthesize speech. Hence to solve this problem we used LSTM and BiLSTM deep learning approaches. Because Deep learning has the ability to learn complex patterns in data and synthesize speech not required hard coding by humans. The performance of the LSTM model of MCD, MSE, and MAE is 0.2961, 0.0940, and 0.2474 respectively. And The performance of the BiLSTM model of MCD, MSE, and MAE is 0.2910, 0.0916, and 0.2400 respectively. As we have computed the translation performance of these models the BiLSTM has better performance. The second performance measurement of subjective evaluation metrics MOS is used to measure the quality of ineligibility and naturalness 4.14 and 3.93 respectively.
Keywords: Text-to-Speech translation, Deep learning, long short-term memory (LSTM), and Bidirectional LSTM. |
en_US |