dc.description.abstract |
Speech is a formal and ubiquitous means of communication among human beings. This has motivated various researchers to consider speech signals as a relevant and effective process for facilitating human-human and human-computer interaction. However, practical real-time systems face complex challenges in constructing a recognition system. Determining the appropriate emotion for a given speech is difficult in real-time environments, as emotions are often separated by a minimum distance. The problem of speech emotion recognition is challenging for three main reasons: human emotions are abstract and difficult to differentiate, emotions can only be detected at specific moments during a long utterance, and there is a scarcity of speech data with emotional classification while spectrogram-based features provide a general representation without emphasizing human knowledge. To address these limitations and improve SER performance, this study proposes the integration of local and global aware features with spectrogram and auditory-based features to differentiate emotions. preparation of a real dataset collected from supreme court, and Cafe in real environments, speech data record by using a smartphone. For quality enhancement noise reduction. We used Hybrid noise filtering techniques (spectral subtraction and MMSE) are used to remove noise, while used MFCC is for handcrafted features extraction and a capsule network is utilized to extract deep features from spectrogram images. The combination of both features is then used, and the performance is evaluated using SVM classifiers for recognizing emotions The classification average accuracy achieved in this thesis is 87%.
Keywords: Spectrogram image, MFCC, speech emotion recognition, feature extraction, Caps Net |
en_US |