Abstract:
The two commonly used ways of communication that are used to express human emotion are facial expression and affective speech. SER (speech emotion recognition) is a technology that extract emotional features from speech signals and analyses the emotional change of the signals. In this work, we focus on SER because it has the capability of expressing the internal feelings of the person. Still-now SER is an important research area and it is an efficient and most commonly used research approach. In SER, using either auditory features (F0, MFCC, LPCC, energy, and zero-crossing rate, etc.) or spectrogram based features have its own limitation. The auditory-based features can highlight human knowledge, whereas spectrogram-based features enable general representation. So to increase SER performance, we used both spectrogram and auditory-based features to discriminate one emotion from the other. For the SER purpose, we collect Amharic corpus from the hospital, and residence places. The main challenges in SER are identifying good features with their classification and feature extraction approaches, the existence of similar contents expressed in different emotions, getting real datasets from the real environment. This research aims to recognize emotion from Amharic language, because it is different from the languages by the change of a single emotional signal by slackening and tightening. So, in our experiment the special characteristics of the language create confusion in Neutral and Angry emotion. In this research, we used hybrid noise filtering techniques (spectral subtraction and MMSE) for removing noise from speech, then we used CNN- BiLSTM to extract features from the spectrogram image and we used handcrafted feature extraction methods to extract auditory-based features. Then after extracting deep features we used PCA to reduce the dimension of the extracted feature vectors. Finally, we send the combined features to the SVM classifier with RBF kernel to label emotional classes like Sadness, Anger, Happiness, and Neutral emotions from the 1042 total SER dataset. We achieved 96.75% for the hybrid model recognition accuracy.