Abstract:
In Ethiopia, a number of languages are spoken. Among these languages, Amharic is the working language of the Federal Government and spoken by a large part of the Ethiopian population. Due to this there is a big quantity of available speech information which is generated from television, telephone, radio, lectures, meetings, and internet that rapidly increases through time. It is obvious that time is money and also information is power. But the challenge is getting the right information within short period of time. So, in order to get the required information within short period of time from a large storage which contains huge amount of audio data, speaker diarization plays a great role. Speaker diarization is the process of segmenting or annotating a given speech data based on speaker’s identity. More researchers have been conducted on speaker diarization for different language. But there is no works on speaker diarization for Amharic language. This study focused on developing a speaker diarization model for Amharic language by using deep learning approach.
The proposed model has three components: preprocessing, feature extraction, and speaker classification. In preprocessing, we have done voice activity detection and spectrogram generation for a speech data. Voice activity detection is used for separating a speech and non_speech from the input Amharic audio data. In feature extraction, we propose to combine Mel-Frequency Cepstral Coefficients (MFCC) and Convolutional Neural Network (CNN) features. For speaker classification we use support vector machine (SVM) with Radial Basis Function (RBF) kernel function.
The proposed model is implemented using Keras (using TensorFlow as a backend) in python programming tool and tested using a test dataset. Accordingly, the model achieved a classification accuracy of 99.8% for training and 98.60% for testing to annotate a given speech data based on speaker’s identity. Our model was faster to train as compared to the end-to-end CNN model and other pretrained CNN models like AlexNet and LeNet. In addition, the combination of MFCC features and CNN features is also used to improve the performance of the model as well by 4% (end_to_end CNN), 9% (AlexNet) and 8% (LeNet).