AN ENSEMBLE OF VECTOR QUANTIZATION AND CNN FOR DIGITAL  BASED TEXT INDEPENDENT AMHARIC LANGUAGE SPEAKER  RECOGNITION

SEWUNET, ASMARE

AN ENSEMBLE OF VECTOR QUANTIZATION AND CNN FOR DIGITAL BASED TEXT INDEPENDENT AMHARIC LANGUAGE SPEAKER RECOGNITION

SEWUNET, ASMARE

URI: http://ir.bdu.edu.et/handle/123456789/15705

Date: 2023-10

Abstract:

Speaker recognition is the process of identifying the speaker based on the acquired voices. There are different studies conducted on speaker recognition systems for different languages using different techniques i.e. Mel Frequency Cepstrum Coefficient, Gaussian Mixture Models, i vector methods, vector quantization and fusion techniques. However, such models are limited by their dependency on hand-crafted feature engineering, model limitation as the number of dataset increases and susceptibility to noise, handling mimicry and performance deficiency for short utterances. Related to this, due to all-natural languages having their particular characteristics, it is impossible to use an identical speaker recognition model for different languages. In this thesis, an ensemble of vector quantization (VQ) and Convolutional Neural Network (CNN) had been used for a text-independent Amharic language speaker identification. For our recognition model, speech signals were collected from 200 individual speakers including both genders. For our dataset, a total of 2000 speakers’ speech samples were collected, and each speech has 10 seconds duration. To build our model, we have used 80% speech samples for training, and 20% speech samples for testing. After being collected and pre-processed, each speech is transformed into a spectrogram image. Then, CNN extract spectral features from the resized spectrogram images. Besides, VQ is used to extract features from the framed voice signal. Both the CNN and VQ features are ensemble together and submitted to our model. We have also tested an end to end CNN, VQ and CNN-VQ. From the experiment an ensemble feature vector of CNN and VQ on VQ classifier achieved 97.23%. But we have found 89.61% using an end to end CNN; 74.87% accuracy achieved using CNN features on VQ classifier; 91.45% accuracy achieved using CNN VQ features using CNN classifier. Finally, to evaluate the performance of our proposed model, we compared our model with a pre-trained AlexNet model using our dataset. Then, we have found 90.19 % accuracy. So, using CNN and VQ as a feature extraction and VQ as a classification approach enhances both the accuracy of speaker identification. Keywords: CNN, Signal processing, VQ and Spectrogram

Show full item record