Abstract:
Speech recognition, also known as automatic speech recognition (ASR), is a technology
that enables software to transcribe spoken language into text. However, traditional ASR
methods require multiple separate blocks, such as language, acoustic, and pronunciation
models with dictionaries, which can be time-consuming and impact performance. This
study proposes an approach that replaces much of the speech pipeline with a single
recurrent neural network (RNN) architecture. Our proposed architecture is based on a
hybrid approach that combines a convolutional neural network (CNN) with a recurrent
neural network (RNN) and a connectionist temporal classification (CTC) loss function. We
perform three main experiments using different datasets: one with clean audio data
consisting of 576,656 valid sentences, another with noisy audio data containing 20,000
valid sentences, and a third experiment that combined both datasets resulting in 596,656
valid sentences. The system was evaluated using the word error rate (WER) metric,
achieving impressive results of 2% WER on noise-free data, 7% WER on noisy data, and
5% WER on combined data. This approach has significant implications for the field of
speech recognition, as it reduces the human effort required to create dictionaries and
improves the efficiency and accuracy of ASR systems, making them more practical for
real-world applications. For future improvements, we suggest considering the inclusion of
dialectal and spontaneous data in the dataset. Additionally, fine-tuning the model on
specific tasks can help tailor its performance to specific objectives or domains, further
improving its effectiveness in those areas.