Abstract:
Topic clustering is one of the methods to organize comments posted on Online Media Service
(OMS) news. Online news such as social media news has many comments every day. However,
most comments are not well - organized to easily find relevant information on a specific topic. And
topic clustering for short text documents is a very challenging task , especially for comments that
are very concise and contain few words per document. In addition , the short text has the problems
of data sparsity and irregularity, and most words only appear once in a short text. To the best of
our knowledge, there is no work for clustering short Amharic comments. To address the
aforementioned problems, we have developed an Amharic comments topic clustering model using
contextual sentence representation and partition - based algorithms. Th is thesis aim s to design and
develop a topic clustering model for Amharic comments on OMS news. We used BERT
(Bidirectional Encoder Representations from Transformers) models for contextual sentence
representation . The transfer learning method is used for sentence embedding of Amharic
comments using English BERT. Finally, we applied mini - batch k- means and Fuzzy c- means
clustering algorithms. We conducted experiments on the two models and the experiment results
show that BERT embedding with mini - batch K- means clustering algorithm and BERT with fuzzy
C- means clustering h as equal values of 1.0 of the v - measure score, adjusted- rand- score, and
adjusted- mutual - information - score. But fuzzy C - means have a lower silhouette- score value of
0.996 than mini - batch K- means which have a 0.998 score value. Mini - batch K- means clustering
is more accurate and takes less time to compute. Fuzzy C - means clustering shows similar results
that are comparable to mini - batch K- means clustering, but it takes longer to compute. Therefore,
the mini - batch K- means clustering algorithm was found to be more appropriate to cluster Amharic
comments to news