Abstract:
Social media facilitates communication and information sharing, but it also allows the
spreading of hate speech. Identifying hate speech using textual data is widely studied.
While many scholars study the identification of hate speech using textual data, memes
are increasingly used to spread hate speech, which can bypass traditional unimodal
textual-based detection models. For low-resourced languages like Amharic, detecting
hate speech adds a layer of complexity. To address this issue, we create a new hate
speech detection model for Amharic memes. In order to curate a dataset of memes,
we gather 2007 samples from various social media platforms. To ensure the accuracy
of annotations, each meme is evaluated by three separate annotators. To facilitate this
process, we develop a web-based annotation tool. We utilize a majority voting system
to determine the most reliable label for each meme. Our method uses Tesseract for text
extraction and VGG16 and Word2vec for feature extraction. The model is trained using
the combined features to detect multimodal inputs effectively.
We train unimodal and multimodal models using deep learning approaches such as
LSTM, BiLSTM, and CNN. Based on the findings, the results suggest that the BiLSTM
model exhibits better performance on both the textual and multimodal datasets, with
63% and 75% performances in accuracy, respectively. In contrast, the CNN model
surpasses the image dataset, achieving an accuracy of 69%. We compare and contrast
the unimodal and multimodal models, and find that the multimodal dataset is better
at detecting hate speech in memes than the unimodal dataset. We also find that the
imaging modality contributes more to the multimodal model than the text. We suggest
that collecting various memes from different social media platforms can improve the
performance of the models.
Key words:- Multimodal, Hate speech, VGG16, Word2vec, and Amharic