Designing phrase-level visual speech recognition for Amharic language using Deep learning approach

Dessalew, Getnet

dc.contributor.author	Dessalew, Getnet
dc.date.accessioned	2021-09-21T12:36:40Z
dc.date.available	2021-09-21T12:36:40Z
dc.date.issued	2021-07
dc.identifier.uri	http://ir.bdu.edu.et/handle/123456789/12623
dc.description.abstract	Visual speech recognition or lip reading is a process of understanding the speech of a speaker by looking at the movement of the mouth of the speaker. Human beings are seen struggling to understand what the speaker is speaking about in different situations. Corrupted video data, such as its sound (audio data) is deleted or distorted intentionally or unintentionally, video data captured by surveillance cameras (most of the time, security camera videos captured from a far distance, have no audio data or the audio is unusable), hearing impaired people who can't listen to the voice, and other situations try to understand the speech by looking at the movement of the lip of the speaker. Understanding speech by looking at the movement of the lip is a very difficult task for human beings. Research shows that a trained person can recognize 20% of a speech by looking at the movement of the lip of the speaker. Since visual speech recognition is a very difficult task for a human being, visual speech recognition must be automated. Different researchers across the world have studied different research on automatic visual speech recognition for different languages. The previous visual speech recognition for Amharic are proposed for word and digit level. In addition to this, they consider the video data taken only from the front side. In this study, we proposed phrase-level visual speech recognition for the Amharic language by using a deep learning algorithm. And we considered the video captured from two angels, front and side. Amharic is one of the most widely spoken languages in Ethiopia. We collected our own dataset of some sample phrases from some Amharic language speakers. We preprocess the collected video data: the video data is converted into frames, then the frames pass through some image preprocessing stages. Finally, we assemble and label our image data. The preprocessed data is used to extract features and, finally, classification is performed. For preprocessing, we use some python packages like open CV (open source computer vision), matplotlib, Viola and Jones (to detect the face and the lip), etc. We use a convolutional neural network (CNN) to extract features. For classification, we use a recurrent neural network (RNN), specifically the Bidirectional Long Short Term Memory (LSTM) and we achieved the recognition accuracy of 89%.	en_US
dc.language.iso	en_US	en_US
dc.subject	INFORMATION TECHNOLOGY	en_US
dc.title	Designing phrase-level visual speech recognition for Amharic language using Deep learning approach	en_US
dc.type	Thesis	en_US