Abstract:
In Ethiopia more than 200 different dialects are spoken in 83 languages, including the Afan
Oromo, Amharic and Tigrigna which have the largest ethnic and linguistic groups (Wimsatt &
Wynn, 2011). According to Wimsatt & Wynn (2011), Amharic has over 30 million native
speakers living in Ethiopia and 62 million speakers globally. It has five known dialect categories
that are spoken in different parts of Ethiopia such as Addis Ababa dialect, Gojjam, Gondar,
Wollo and Shewa dialects.
But for the purpose of this study only the dialects that are spoken in Amhara region namely
gojam, wollo, shewa and gonder are considered. There are few research attempts to develop
dialect classification models to classify Amharic dialect categories. However Most of the studies
focused on identifying Amharic dialects using audio data recorded in controlled environment
which is relatively free from noise. On top of this the methods used on the existing research has
its own drawbacks to perform classification. The purpose of this study therefore is to explore the
possibility of developing dialect classification model by using audio data recorded in
uncontrolled environment that involves background noises using other machine learning
techniques.
In this study an attempt is made to develop Amharic dialect classification model using CNN and
CNN-SVM techniques. A spontaneous data is collected from amhara media corporation archive
system. Since the data is recorded from different parts of amhara rigion by a camera man, it is
uncontrolled data which contains background noises. This data is then stored on AMECO
archive system as raw data. For each Amharic dialect category 300 speech data is collected
which contains a total of 1200 Utterances spoken by people who live in Amhara region. Since
the data contains background noises and other irrelevant things preprocessing operations are
performed to remove the different types of noise and silence in the audio signal. Silence is
removed by applying thresholding technique and the background noise removed by applying
moving average filter which is a low pass filter.
Audio features are extracted as a form of spectrogram and used for model development. Our
experiments confirmed that using CNN model an Accuracy of 85% is achieved when RELUE
activation function is used and 79% accuracy achieve when tan activation function used. The accuracy obtained for both techniques (CNN and CNN-SVM) is compared and CNN alone
achieved better classification accuracy. Our CNN Amharic dialect identification model is
compared with state-of-the-art models and showed better recognition performance on the current
data sets used.
In general, training deep learning algorithms with more data will increase the accuracy of the
recognition model. Therefore, it is better to use more data and other speech preprocessing
operations to further improve the accuracy of Dialect Identification model. On top of this it is
recommended to consider a robust system which handles background noises collected from
uncontrolled environments by it to enhance the performance.
KEYWORDS: Amharic Dialects identification, CNN, CNN-SVM, Mel spectrogram, spontaneous
speech, Aquostic feature, confusion matrix, low pass filter, state of the art models, uncontrolled data