Abstract:
Speech synthesis is the translation of a given text to the corresponding narrative voice. The existing research work in this language is not enough and needs further investigation to adapt it to the digital world. This language does not also have an Amharic fiction narrator system to narrate the text of the fiction and produce speech waveforms. To overcome this problem, we have designed the Amharic fiction narrator system using a deep learning approach since it achieves some significant progress in image processing, machine translation, speech recognition, and speech synthesis systems by analyzing the internal complex data structures. The fiction audio is taken from internet YouTube channels, and we have split the audio with an average length of 7 seconds duration with their corresponding text. We have prepared a total of 1253 utterance of speech to train the model. We have used CNN for context independent feature extraction of characters. The extracted features passed through the bidirectional LSTM network to encode context dependent text features to some hidden state representation, and we have used the Bahdanau attention mechanism to get alignment between encoder and decoder networks. In short attention is used to verse a character grapheme with its acoustic unit. We have used LSTM layers to predict spectrogram images of acoustic units for the encoded hidden states. That means the decoder network is used to estimate the acoustic units such as pitch and duration of characters and represent in terms of spectrogram images. Spectrogram images are generated using short-time Fourier transforms with Hanning and Hamming window function to smooth discontinuities in the signal. These are used to select the better windowing function for the Amharic fiction narrator system. Finally, a Griffin-Lim synthesizer with inverse short time Fourier transforms is used to estimate phases and waveforms from the magnitude spectrogram images. We have obtained 3.6 and 3.4 MOS test result for speech intelligibility and naturalness respectively using the Hanning window function. We have also obtained 3.5 and 3.4 MOS tests using a Hamming window function. The challenge of the study is the distribution of the characters in the corpus since some characters are rarely used in the nature of the language, and a powerful computer is needed to train speech waveforms.