Amharic Language Speech (Lip Motion) Recognition Using  Deep Learning

Dagmawi, Samuel

BDU IR Home
→
Bahir Dar Institute of Technology (BiT)
→
Faculty of Electrical and Computer Engineering
→
Electrical Engineering
→
thesis
→
View Item

Amharic Language Speech (Lip Motion) Recognition Using Deep Learning

Dagmawi, Samuel

URI: http://ir.bdu.edu.et/handle/123456789/14482

Date: 2022-07

Abstract:

Visual speech recognition, often known as lip reading, is a technique for interpreting a speaker's words by seeing his or her mouth movement. People are shown straining to understand what the speaker is saying in numerous settings. Video data that has been corrupted, such as sound (audio data) that has been intentionally or unintentionally deleted or distorted, video data captured by surveillance cameras (most security camera videos captured from afar have no audio data or the audio is unusable), hearing impaired people who can't hear the voice, and other situations try to understand the speech by looking at the speaker's lip movement. Because visual voice recognition is such a difficult task for a human to perform, it must be automated. Several scholars from around the world have conducted various studies on programmed visual speech recognition for various dialects. For the word and digit level, previous visual discourse recognition for Amharic is proposed. In addition, they take into account solely the video data from the front side. Using a deep learning computation, we presented phrase-level visual discourse recognition for the Amharic dialect in this study. Amharic is one of Ethiopia's most widely spoken dialects. A claim dataset comprising a few test expressions was gathered from a few Amharic language speakers. Because a video is made up of a series of sequential images, the video data will be encased in a pattern of picture outlines. The gathered video data is preprocessed: the video data is converted to outlines, and the outlines are then passed through a series of picture preprocessing forms. Finally, we put our photo data together and give it a name. The preprocessed data is then used to extract highlights, which is followed by categorization for preprocessing, we use OpenCV (Open Source Computer Vision), Matplotlib, Viola and Jones (to identify the face and the lip), and other python tools. We use a Convolutional Neural Network (CNN) to extract highlights. In the system 70% of the dataset were used for training and 30% of the datasets were used for testing. We extract 22 frames for a single phrase from a single video. For categorization, we used a Recurrent Neural Network (RNN), especially the Bidirectional Long Short Term Memory (BiLSTM) and we meet an accuracy of 92% for our study. Keywords: Amharic; Lip-reading; Deep Learning, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN)

Show full item record