Abstract:
The rapid growth of social media has transformed cultural content sharing and engagement in Ethiopia, yet much content lacks interactivity, resulting in low user engagement and shallow cultural appreciation. Ethiopian cultural posts often rely on static images and text that users scroll past, missing the chance for deeper engagement and learning. This study addresses the problem by developing a bimodal question generation (QG) model integrating visual and textual data to enhance user interaction and understanding of Ethiopian cultural content. Using a dataset of 2,100 annotated images from Facebook, Telegram, and Instagram, six content creators with over 1,000 followers each contributed three questions per image to foster engagement.
This study implements and compares ResNet-50 with LSTM, VGGNet-16 with GRU, and a transformer-based encoder-decoder model, with preprocessing that includes image sharpening, histogram equalization, and data augmentation techniques like rotation, shifting, zooming, and flipping. Performance was evaluated using BLEU-4, METEOR, ROUGE, and CIDEr metrics, yielding scores of 0.3756, 0.5080, 0.5388, 0.4183, and 2.9023, respectively, along with a human evaluation score of 3.2. Transformer models outperformed other architectures, effectively capturing visual-textual relationships and improving interaction with cultural content. Future work should expand the dataset and explore multimodal approaches to further strengthen model robustness.
Keywords: Ethiopian Culture, social media, Bimodal Question Generation, Visual Question Generation (VQG)