Abstract:
The widespread availability of portable cameras and mobile phone cameras has led
to a significant increase in the number of captured images. However, effectively
describing the information contained in these images remains a major challenge.
Image caption generation, a complex task in computer vision and natural language
processing, aims to automatically generate accurate descriptions of image content.
Unfortunately, many existing approaches in image captioning rely on simplistic
feature extraction techniques that do not fully exploit the potential of object
detection and color values, leading to inaccurate or incomplete descriptions. In
the context of image captioning, holy pictures hold special religious significance.
Surprisingly, there has been a noticeable lack of studies focusing on generating
accurate captions for such images. To address this gap, we proposed the Holy
Pictures Amharic Caption Generation system, which leverages advanced techniques.
Our approach involved collecting 2,300 holy pictures from various sources and
manually localizing and preparing the captions and object labels for 2,070 images
in the training and validation datasets. To enhance image quality, we applied
techniques such as CLAHE histogram equalization, YCrCb color space conversion,
and bilateral filtering for noise removal. For the caption generation process,
we utilized the SSD object detection model for accurate object localization and
color feature extraction. The input images were encoded using the XceptionV3
architecture to generate image features, which were then decoded using the LSTM
with Attention mechanism to generate descriptive captions. Our experimental
results demonstrated superior performance, achieving scores of 0.84, 0.88, 0.89, and
0.81 for the BLEU-4, CIDEr, SPICE, and HEM metrics, respectively. These results
highlight the effectiveness of leveraging advanced image processing techniques, color
values, and object detection in improving the accuracy and detail of image captions.
However, a weakness lies in the small dataset used, limiting generalization. Future
research should focus on expanding the dataset to improve the model’s applicability
and performance
Keywords: image caption; color space; LSTM; SSD; object detection, color feature,
attention