DESIGNING AN AUTOMATIC SPEECH BASED DIALECT IDENTIFICATION  USING MACHINE LEARNING APPROACHES

Mezmur, Bekalu

DESIGNING AN AUTOMATIC SPEECH BASED DIALECT IDENTIFICATION USING MACHINE LEARNING APPROACHES

Mezmur, Bekalu

URI: http://ir.bdu.edu.et/handle/123456789/14785

Date: 2022-08

Abstract:

In Ethiopia more than 200 different dialects are spoken in 83 languages, including the Afan Oromo, Amharic and Tigrigna which have the largest ethnic and linguistic groups (Wimsatt & Wynn, 2011). According to Wimsatt & Wynn (2011), Amharic has over 30 million native speakers living in Ethiopia and 62 million speakers globally. It has five known dialect categories that are spoken in different parts of Ethiopia such as Addis Ababa dialect, Gojjam, Gondar, Wollo and Shewa dialects. But for the purpose of this study only the dialects that are spoken in Amhara region namely gojam, wollo, shewa and gonder are considered. There are few research attempts to develop dialect classification models to classify Amharic dialect categories. However Most of the studies focused on identifying Amharic dialects using audio data recorded in controlled environment which is relatively free from noise. On top of this the methods used on the existing research has its own drawbacks to perform classification. The purpose of this study therefore is to explore the possibility of developing dialect classification model by using audio data recorded in uncontrolled environment that involves background noises using other machine learning techniques. In this study an attempt is made to develop Amharic dialect classification model using CNN and CNN-SVM techniques. A spontaneous data is collected from amhara media corporation archive system. Since the data is recorded from different parts of amhara rigion by a camera man, it is uncontrolled data which contains background noises. This data is then stored on AMECO archive system as raw data. For each Amharic dialect category 300 speech data is collected which contains a total of 1200 Utterances spoken by people who live in Amhara region. Since the data contains background noises and other irrelevant things preprocessing operations are performed to remove the different types of noise and silence in the audio signal. Silence is removed by applying thresholding technique and the background noise removed by applying moving average filter which is a low pass filter. Audio features are extracted as a form of spectrogram and used for model development. Our experiments confirmed that using CNN model an Accuracy of 85% is achieved when RELUE activation function is used and 79% accuracy achieve when tan activation function used. The accuracy obtained for both techniques (CNN and CNN-SVM) is compared and CNN alone achieved better classification accuracy. Our CNN Amharic dialect identification model is compared with state-of-the-art models and showed better recognition performance on the current data sets used. In general, training deep learning algorithms with more data will increase the accuracy of the recognition model. Therefore, it is better to use more data and other speech preprocessing operations to further improve the accuracy of Dialect Identification model. On top of this it is recommended to consider a robust system which handles background noises collected from uncontrolled environments by it to enhance the performance. KEYWORDS: Amharic Dialects identification, CNN, CNN-SVM, Mel spectrogram, spontaneous speech, Aquostic feature, confusion matrix, low pass filter, state of the art models, uncontrolled data

Show full item record