BDU IR

AMHARIC SPEAKER DIARIZATION SYSTEM USING DEEP LEARNING APPROACH

Show simple item record

dc.contributor.author ABEBE, FENTA MIHIRET
dc.date.accessioned 2021-09-22T12:04:02Z
dc.date.available 2021-09-22T12:04:02Z
dc.date.issued 2021-06
dc.identifier.uri http://ir.bdu.edu.et/handle/123456789/12637
dc.description.abstract In Ethiopia, a number of languages are spoken. Among these languages, Amharic is the working language of the Federal Government and spoken by a large part of the Ethiopian population. Due to this there is a big quantity of available speech information which is generated from television, telephone, radio, lectures, meetings, and internet that rapidly increases through time. It is obvious that time is money and also information is power. But the challenge is getting the right information within short period of time. So, in order to get the required information within short period of time from a large storage which contains huge amount of audio data, speaker diarization plays a great role. Speaker diarization is the process of segmenting or annotating a given speech data based on speaker’s identity. More researchers have been conducted on speaker diarization for different language. But there is no works on speaker diarization for Amharic language. This study focused on developing a speaker diarization model for Amharic language by using deep learning approach. The proposed model has three components: preprocessing, feature extraction, and speaker classification. In preprocessing, we have done voice activity detection and spectrogram generation for a speech data. Voice activity detection is used for separating a speech and non_speech from the input Amharic audio data. In feature extraction, we propose to combine Mel-Frequency Cepstral Coefficients (MFCC) and Convolutional Neural Network (CNN) features. For speaker classification we use support vector machine (SVM) with Radial Basis Function (RBF) kernel function. The proposed model is implemented using Keras (using TensorFlow as a backend) in python programming tool and tested using a test dataset. Accordingly, the model achieved a classification accuracy of 99.8% for training and 98.60% for testing to annotate a given speech data based on speaker’s identity. Our model was faster to train as compared to the end-to-end CNN model and other pretrained CNN models like AlexNet and LeNet. In addition, the combination of MFCC features and CNN features is also used to improve the performance of the model as well by 4% (end_to_end CNN), 9% (AlexNet) and 8% (LeNet). en_US
dc.language.iso en_US en_US
dc.subject INFORMATION TECHNOLOGY en_US
dc.title AMHARIC SPEAKER DIARIZATION SYSTEM USING DEEP LEARNING APPROACH en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record