dc.description.abstract |
Speech is a natural way of transforming information between the speaker and the listener. Speaker
identification is the process of identifying who is speaking based on his /her unique voiceprint
features. Studies on speaker identification systems were done for different languages using the
traditional Mel Frequency Cepstral Coefficient, Gaussian Mixture Models, i-vector methods, and
fusion techniques. However, such models are limited by their dependency on hand-crafted feature
engineering, processing time, susceptibility to noise, and performance deficiency for short
utterances. Related to this, due to all-natural languages having their particular characteristics, it is
impossible to use an identical speaker identification model for different languages. In this thesis,
an end-to-end Convolutional Neural Network and a combined convolutional neural network with
a support vector machine approach had been used for a text-independent Amharic language
speaker identification. For our identification model, speech signals were collected from thirty
individual speakers including both genders. For our dataset, a total of 1500 speakers’ speech
samples were collected, and each speech has 10 seconds duration. To build our model, we have
used 1200 speech samples for training, 300 speech samples for testing. After being collected and
pre-processed, each speech is transformed into a spectrogram image by using digital signal
processing techniques. Then, the resized spectrogram images are used as input to the proposed
model to learn and extract speaker-specific spectral features. Our model achieved 94.4 % and 98.8
% accuracy for an end-to-end Convolutional Neural Network and a convolutional neural network
with a support vector machine approach respectively. Finally, to evaluate the performance of our
proposed model, we compared our model with a pre-trained AlexNet model using our datasets.
Then, we have found 80 % accuracy for a pre-trained end-to-end AlexNet model. So, using
Convolutional Neural Network as a feature extraction and support vector machine as a
classification approach enhances both the accuracy and training time of an end-to-end
Convolutional Neural Network and a pre-trained AlexNet model.
Keywords: CNN, MFCC, STFT, SVM, AlexNet, Spectrogram, Speaker Identification and,
Amharic Language |
en_US |