dc.description.abstract |
Sign language is a vital means of communication for the deaf communities, encompassing complex
gestures and facial expressions to convey meaning. Recognizing sign language through
automated systems presents a significant challenge due to its dynamic nature and the variability
among individual signers. For effective communication, Ethiopian Sign Language (EthSL)
relies on the accurate recognition of continuous sequences of signs. Previous research efforts
have primarily concentrated on recognizing isolated signs at the character, word, and phrase levels.
While these studies have been valuable, they fall short of capturing the full complexity and
fluidity of natural sign language conversations, which occur at the continuous (sentence) level.
Previous researchers have used traditional methods for EthSL recognition, typically decomposing
the task into separate stages. The primary problem addressed in this study is the recognition
of continuous EthSL in End-to-End approach, which involves understanding sequences of signs
performed by multiple signers. This task is crucial for creating practical sign language recognition
systems for real-world use, such as translation services and communication aids for the
Deaf community.
To address the challenge of continuous EthSL recognition, we created the first-ever dataset for
Continuous (Sentence-Level) EthSL. This dataset features recordings of 30 unique sentences
performed by 22 different signers. Each signer performed each sentence twice, resulting in a
total of 1,320 sentences. Our methodology uses an end-to-end approach that employs a 2D convolution
to extract spatial features from individual video frames, followed by a 1D convolution to
capture short-term motion details. These spatial and temporal features are then processed using
a Bidirectional Long Short-Term Memory (BiLSTM) network to recognize long-term temporal
patterns and the sequential nature of sign language. Finally, a classifier with Connectionist
Temporal Classification (CTC) is used to align and recognize the continuous sign language sequences.
This integrated approach effectively harnesses both spatial and temporal information
to improve recognition accuracy.
The system evaluated on test dataset considering signer independence and unseen sentences
split, achieving Word Error Rates (WER) of 8.02% and 47.02%, respectively. These results
highlight the method’s robustness in handling variability among different signers in unseen signers,
showcasing its strength in signer independence. Achieving a WER of 47.02% for unseen
sentences split is a notable result, but it also indicates a substantial area for improvement. To
address this, future research should focus on expanding the dataset to include a wider variety of
sentences, potentially through collaborative data collection efforts. Overall, the study presents
a significant advancement in continuous EthSL recognition, with a robust approach to signer
independence but also highlights the need for further efforts to enhance performance on unseen
sentence data.
KeyWords: Ethiopian Sign Language; Sign Language Recognition; Deep Learning; Continuous SLR |
en_US |