Abstract:
Spoken language identification is the process of deciding which language a speaker is speaking. Spoken language identification is used as a front-end processing in human-computer interaction, speech to text translation, speech to speech translation, and automatic caller routing to the intended operator. Lots of studies on spoken language identification were done using a Gaussian mixture model, i-vector, and neural network approaches. However, a Gaussian mixture model and i-vector approaches are not robust in the noise environment. Even though a deep neural network has better performance in short utterance, it is computationally expensive. In order to overcome these problems, we propose a noise-resistant Ethiopian spoken language identification model for Amharic, Tigrigna, Oromia, and Somalia languages. For the dataset, we have used a noisy data from meetings, discussions, conferences, and reports. Since back propagation neural network are slow, we proposed a feed-forward neural network and convolutional neural network based models. In the first model, an acoustic feature with a feed-forward neural network classifier was used. In this method, we compared five acoustic features and we found a better accuracy of 88% with delta Mel frequency cepstral coefficient. The second method we used an end to end convolutional neural network and convolutional neural network with a support vector machine. We found an accuracy of 98% in the end to end convolutional neural network and 97% in the convolutional neural network with support vector machine. So, the support vector machine can improve the training time of the convolutional neural network without significantly degrading the accuracy.