AMHARIC TEXT HATE SPEECH DETECTION IN SOCIAL MEDIA USING DEEP LEARNING APPROACH

EMUYE, BAWOKE

AMHARIC TEXT HATE SPEECH DETECTION IN SOCIAL MEDIA USING DEEP LEARNING APPROACH

EMUYE, BAWOKE

URI: http://ir.bdu.edu.et/handle/123456789/12723

Date: 2020-07

Abstract:

The connectivity and accessibility of social media platforms in the world allow people to express their ideas and share experiences easily. However, the anonymity and flexibility afforded by the Internet have made it easy for users to communicate aggressively. Hate speech affects the society in many aspects, such as affecting the mental health of targeted audiences, affects social interaction, leads to violence and distraction of properties. Determining a text that containing hate speech is a difficult task for humans, it is timeconsuming, tedious, and introduces subjective notions of what constitutes a text to be hate or offensive speech. As a solution to address the problem, this research develops a detection model for Amharic hate speech texts using deep learning approaches. In this research, we prepare a new Amharic hate speech dataset from Facebook and Twitter social media that are labeled into four classes, and then the data is augmented to balance the category class. Word2vec embedding and word embeddings using Keras are used as a feature for the deep learning models. CNN, LSTM, Bi-LSTM, GRU, and combined CNN-LSTM models trained using the whole dataset with the Word2vec embedding feature and automatically generated features using the embedding layer for both augmented and original dataset. We evaluate the models using (80,20) train-test split with precession, recall, and f1-score performance metrics were used to compare the models. Using the two datasets the study developed five different models with each feature through the original and augmented dataset. The model based on BILSTM with word2vec achieves slightly better performance than the other models for both augmented and original dataset. According to the classification performance result, the model with augmented data shows a little bit less confusion between offensive and both (hate and offensive) than the model without augmented dataset. However, the models mostly tend to misclassify hate speech’s as both (hate and offensive) speech. Generally, BILSTM achieves the highest F1-score (90%), and also the CNN classifier performs an f1-score (89%).

Show full item record