BDU IR

SPECTROGRAM IMAGE AND MEL-FREQUENCY CEPSTRAL COEFFICIENT ASSISTED CONVOLUTIONAL NEURAL NETWORK FOR TEXT INDEPENDENT AMHARIC LANGUAGE SPEAKER IDENTIFICATION

Show simple item record

dc.contributor.author Mulusew, Fentahun Getu
dc.date.accessioned 2022-03-18T06:35:51Z
dc.date.available 2022-03-18T06:35:51Z
dc.date.issued 2021-11-09
dc.identifier.uri http://ir.bdu.edu.et/handle/123456789/13211
dc.description.abstract Speech is a natural way of transforming information between the speaker and the listener. Speaker identification is the process of identifying who is speaking based on his /her unique voiceprint features. Studies on speaker identification systems were done for different languages using the traditional Mel Frequency Cepstral Coefficient, Gaussian Mixture Models, i-vector methods, and fusion techniques. However, such models are limited by their dependency on hand-crafted feature engineering, processing time, susceptibility to noise, and performance deficiency for short utterances. Related to this, due to all-natural languages having their particular characteristics, it is impossible to use an identical speaker identification model for different languages. In this thesis, an end-to-end Convolutional Neural Network and a combined convolutional neural network with a support vector machine approach had been used for a text-independent Amharic language speaker identification. For our identification model, speech signals were collected from thirty individual speakers including both genders. For our dataset, a total of 1500 speakers’ speech samples were collected, and each speech has 10 seconds duration. To build our model, we have used 1200 speech samples for training, 300 speech samples for testing. After being collected and pre-processed, each speech is transformed into a spectrogram image by using digital signal processing techniques. Then, the resized spectrogram images are used as input to the proposed model to learn and extract speaker-specific spectral features. Our model achieved 94.4 % and 98.8 % accuracy for an end-to-end Convolutional Neural Network and a convolutional neural network with a support vector machine approach respectively. Finally, to evaluate the performance of our proposed model, we compared our model with a pre-trained AlexNet model using our datasets. Then, we have found 80 % accuracy for a pre-trained end-to-end AlexNet model. So, using Convolutional Neural Network as a feature extraction and support vector machine as a classification approach enhances both the accuracy and training time of an end-to-end Convolutional Neural Network and a pre-trained AlexNet model. Keywords: CNN, MFCC, STFT, SVM, AlexNet, Spectrogram, Speaker Identification and, Amharic Language en_US
dc.language.iso en_US en_US
dc.subject INFORMATION TECHNOLOGY en_US
dc.title SPECTROGRAM IMAGE AND MEL-FREQUENCY CEPSTRAL COEFFICIENT ASSISTED CONVOLUTIONAL NEURAL NETWORK FOR TEXT INDEPENDENT AMHARIC LANGUAGE SPEAKER IDENTIFICATION en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record