Abstract:
Speech emotion recognition (SER) is focused on predicting emotion from speech data, irrespective of the semantic contents. However, it is pertinent to note that variability in speech signals can make emotion extraction a challenging task. Much research has been conducted in the areas of speech emotion recognition using prosodic and temporal features. Independently, prosodic, temporal, and linguistic features of speech do not provide results with better accuracy. Besides, Mel-frequency spectrum coefficients (MFCC), modulation spectral (MS) features, and deep learning features are the techniques considered in the previous studies. But these techniques have limitations related to deep-spectrum features. Since both unsupervised and semi-supervised deep spectral and prosodic features contain emotion information, it is believed that the combining of spectral and prosodic features will improve the performance of the emotion recognition system. Therefore, the main objective of this study is to investigate a spectrogram image and spectral coefficients assisted vector quantization model for Amharic speech emotion recognition to avoid the general representation of traditional spectrum coefficients and spectrogram-based features. To achieve the objective of this research, we collected Amharic speech from naïve speakers by using a smartphone and from https://github.com/Ethio2021/ASED_V1. Since we collected our dataset in an uncontrolled environment and a controlled environment, we faced noise due to environmental factors for the one we collected from the naïve speaker. To solve such a problem, applying a single technique may not be suitable. In this regard, we have applied sequentially combined approaches such as discrete Fourier transform (DFT), spectral subtraction, and Wiener filter preprocessing techniques. We also examined the effects of spectrogram image interpolation techniques for better emotion recognition in the Amharic language. For the feature extraction, GAN, MFCC, and LPCC are applied. For the classification, vector quantization (VQ) is applied, and we achieved 92.07% accuracy.
Keywords: DFT, GAN, LPCC, MFCC, Spectral Subtraction, VQ, Wiener Filter.