BDU IR

Designing phrase-level visual speech recognition for Amharic language using Deep learning approach

Show simple item record

dc.contributor.author Dessalew, Getnet
dc.date.accessioned 2021-09-21T12:36:40Z
dc.date.available 2021-09-21T12:36:40Z
dc.date.issued 2021-07
dc.identifier.uri http://ir.bdu.edu.et/handle/123456789/12623
dc.description.abstract Visual speech recognition or lip reading is a process of understanding the speech of a speaker by looking at the movement of the mouth of the speaker. Human beings are seen struggling to understand what the speaker is speaking about in different situations. Corrupted video data, such as its sound (audio data) is deleted or distorted intentionally or unintentionally, video data captured by surveillance cameras (most of the time, security camera videos captured from a far distance, have no audio data or the audio is unusable), hearing impaired people who can't listen to the voice, and other situations try to understand the speech by looking at the movement of the lip of the speaker. Understanding speech by looking at the movement of the lip is a very difficult task for human beings. Research shows that a trained person can recognize 20% of a speech by looking at the movement of the lip of the speaker. Since visual speech recognition is a very difficult task for a human being, visual speech recognition must be automated. Different researchers across the world have studied different research on automatic visual speech recognition for different languages. The previous visual speech recognition for Amharic are proposed for word and digit level. In addition to this, they consider the video data taken only from the front side. In this study, we proposed phrase-level visual speech recognition for the Amharic language by using a deep learning algorithm. And we considered the video captured from two angels, front and side. Amharic is one of the most widely spoken languages in Ethiopia. We collected our own dataset of some sample phrases from some Amharic language speakers. We preprocess the collected video data: the video data is converted into frames, then the frames pass through some image preprocessing stages. Finally, we assemble and label our image data. The preprocessed data is used to extract features and, finally, classification is performed. For preprocessing, we use some python packages like open CV (open source computer vision), matplotlib, Viola and Jones (to detect the face and the lip), etc. We use a convolutional neural network (CNN) to extract features. For classification, we use a recurrent neural network (RNN), specifically the Bidirectional Long Short Term Memory (LSTM) and we achieved the recognition accuracy of 89%. en_US
dc.language.iso en_US en_US
dc.subject INFORMATION TECHNOLOGY en_US
dc.title Designing phrase-level visual speech recognition for Amharic language using Deep learning approach en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record