Abstract:
Visual speech recognition, often known as lip reading, is a technique for interpreting a speaker's
words by seeing his or her mouth movement. People are shown straining to understand what the
speaker is saying in numerous settings. Video data that has been corrupted, such as sound (audio
data) that has been intentionally or unintentionally deleted or distorted, video data captured by
surveillance cameras (most security camera videos captured from afar have no audio data or the
audio is unusable), hearing impaired people who can't hear the voice, and other situations try to
understand the speech by looking at the speaker's lip movement. Because visual voice recognition
is such a difficult task for a human to perform, it must be automated. Several scholars from around
the world have conducted various studies on programmed visual speech recognition for various
dialects. For the word and digit level, previous visual discourse recognition for Amharic is
proposed. In addition, they take into account solely the video data from the front side. Using a
deep learning computation, we presented phrase-level visual discourse recognition for the Amharic
dialect in this study. Amharic is one of Ethiopia's most widely spoken dialects. A claim dataset
comprising a few test expressions was gathered from a few Amharic language speakers. Because
a video is made up of a series of sequential images, the video data will be encased in a pattern of
picture outlines. The gathered video data is preprocessed: the video data is converted to outlines,
and the outlines are then passed through a series of picture preprocessing forms. Finally, we put
our photo data together and give it a name. The preprocessed data is then used to extract highlights,
which is followed by categorization for preprocessing, we use OpenCV (Open Source Computer
Vision), Matplotlib, Viola and Jones (to identify the face and the lip), and other python tools. We
use a Convolutional Neural Network (CNN) to extract highlights. In the system 70% of the dataset
were used for training and 30% of the datasets were used for testing. We extract 22 frames for a
single phrase from a single video. For categorization, we used a Recurrent Neural Network (RNN),
especially the Bidirectional Long Short Term Memory (BiLSTM) and we meet an accuracy of
92% for our study.
Keywords: Amharic; Lip-reading; Deep Learning, Convolutional Neural Network (CNN),
Recurrent Neural Network (RNN)