Abstract:
Speaker recognition is the process of identifying the speaker based on the acquired voices. There
are different studies conducted on speaker recognition systems for different languages using
different techniques i.e. Mel Frequency Cepstrum Coefficient, Gaussian Mixture Models, i vector methods, vector quantization and fusion techniques. However, such models are limited by
their dependency on hand-crafted feature engineering, model limitation as the number of dataset
increases and susceptibility to noise, handling mimicry and performance deficiency for short
utterances. Related to this, due to all-natural languages having their particular characteristics, it is
impossible to use an identical speaker recognition model for different languages. In this thesis,
an ensemble of vector quantization (VQ) and Convolutional Neural Network (CNN) had been
used for a text-independent Amharic language speaker identification. For our recognition model,
speech signals were collected from 200 individual speakers including both genders. For our
dataset, a total of 2000 speakers’ speech samples were collected, and each speech has 10 seconds
duration. To build our model, we have used 80% speech samples for training, and 20% speech
samples for testing. After being collected and pre-processed, each speech is transformed into a
spectrogram image. Then, CNN extract spectral features from the resized spectrogram images.
Besides, VQ is used to extract features from the framed voice signal. Both the CNN and VQ
features are ensemble together and submitted to our model. We have also tested an end to end
CNN, VQ and CNN-VQ. From the experiment an ensemble feature vector of CNN and VQ on
VQ classifier achieved 97.23%. But we have found 89.61% using an end to end CNN; 74.87%
accuracy achieved using CNN features on VQ classifier; 91.45% accuracy achieved using CNN VQ features using CNN classifier. Finally, to evaluate the performance of our proposed model,
we compared our model with a pre-trained AlexNet model using our dataset. Then, we have
found 90.19 % accuracy. So, using CNN and VQ as a feature extraction and VQ as a
classification approach enhances both the accuracy of speaker identification.
Keywords: CNN, Signal processing, VQ and Spectrogram