Abstract:
Speech emotion identification plays a vital role in human-computer interaction and human-to-human communication systems, where accurately recognizing emotions from speech signals can enhance user experience and engagement. Developing effective emotion recognition systems for languages with limited resources, such as Amharic, presents unique challenges. This thesis proposes an approach to speech emotion identification in Amharic speech using a combination of spectrogram analysis and local feature-assisted convolutional neural networks (CNNs). The method leverages spectrogram representation to capture spectral and temporal characteristics and integrates local features—MFCC, chroma, ZCR, energy, and pitch—to enhance emotional cue representation. A dataset of 1,650 three-second Amharic speech recordings annotated with five emotions (anger, fear, happy, neutral, sad) was used. Enhanced preprocessing with spectral subtraction and wavelet transformation improved model accuracy to 90% and reduced training time to 12 minutes and 11 seconds, demonstrating advanced noise reduction's effectiveness. Integrating local and spectrogram features achieved the highest accuracy of 90%, surpassing the 73% and 79% obtained with individual features. The CNN model, utilizing these features, achieved 90% accuracy with balanced precision, recall, and F1-scores of 0.90, outperforming LSTM (58.48%), BiLSTM (63.33%), and GRU (40%) models, which showed overfitting issues. These results highlight the CNN model’s superior performance and robustness, advancing emotion recognition technology for under-resourced languages and improving human-computer and human-human interactions. Future research could explore extending this approach to other languages and incorporating additional emotional cues to further enhance accuracy. The study's applicability is constrained by the dataset's size and diversity.
Key Words: Deep Learning Approach, Speech Emotion Identification, CNN, Spectrogram, LSTM, BiLSTM, GRU