Abstract:
Visual Question Answering (VQA) is a Vision-to-Text (V2T) task that integrates visual features of images with natural language questions to generate meaningful responses. Most existing research has focused on English, leaving a significant gap for other languages, including Amharic. Tourism, a major global industry, relies heavily on interactions where visitors seek information about natural, historical, cultural, and religious sites. Ethiopia is a remarkable tourist destination, home to unique sites most visitors are local, creating an urgent need for a VQA model that can deliver accurate, culturally relevant information in Amharic. Unfortunately, no such model currently exists to assist tourists at these heritage sites. This research addresses this gap by developing an Amharic Visual Question Answering model specifically tailored for Ethiopian tourism. A new Amharic VQA dataset was created using 2,200 diverse images from Ethiopian tourist sites paired with 6,600 questions in Amharic. Our dataset is collected from various sources, including the UNESCO website, the Amhara Tourism office, and online platforms such as Facebook, Free pixel, and Instagram. Each image is complemented by three corresponding questions formulated by three individual experts and answered by ten candidates. The questions, answers, and images are linked through annotations and fed into the model. We used ResNet-50 for feature extraction and Bidirectional Gated Recurrent Unit (BiGRU) with attention mechanisms, achieving a testing accuracy of 54.98%, demonstrating the model's effectiveness in answering questions about Ethiopian heritage. We will expand this research using external knowledge to get answer and description beyond image and custom object detection.
Key word: Amharic Language; Ethiopian tourism; Deep learni