Abstract:
Malaria is a global health challenge causing infections and deaths. It continues to be one of the
leading communicable causes of death worldwide. Currently studying malaria often rely on basic
surveillance system which is not efficient for capturing the factors influencing malaria outbreak
occurrence in Ethiopia. These statistical models are slow, or limited in scope, leading to delayed
responses and not flexible in handling complex relationships and non-linearity in data. In addition,
many machine learning models operate as black boxes, making it challenging to understand their
decision-making process. Early detection of epidemics based on existing technology is crucial for
effective disease control and prevention strategies. This study aims to develop an explainable
machine learning model for early detection of malaria outbreak. The datasets, which use to build
the model collected from Amhara Regional Health Bureau and Amhara Public Health Institute in
Ethiopia. The prediction model developed to predict whether or not an outbreak has occurred based
on the information in the dataset. In this study, machine learning algorithms were utilized to
develop the model. Multiple models, including Logistic regression, Decision Tree, K-Nearest
Neighbors, Artificial Neural Network, Random Forest, and Extreme Gradient Boosting, were
trained to predict the occurrence of malaria outbreaks using the collected dataset. SMOTE was
applied to address class imbalance. Cross-validation was utilized to reduce overfitting by splitting
the data into multiple subsets and iteratively training the model on one. Hyper-parameters were
optimized using Bayesian, Grid Search, and Random Search techniques. Among these techniques,
Grid Search yielded the best combinations of hyper-parameters compared to Random Search and
Bayesian methods. The Performance of the prediction model evaluated with evaluation metrics
such as accuracy, precision, recall, F1-score and AUC ROC curve. The results of the experiments
indicate that XGBoost, achieving an accuracy of 0.98 and an AUC value of 0.99 after SMOTE,
outperformed other machine learning techniques. The combination of gradient boosting,
regularization, feature importance, efficiency, and flexibility make XGBoost a powerful choice for
machine learning tasks. Model explainability techniques such as LIME and SHAP were employed
in this study to make the model more understandable. The study has significant potential to save
lives, optimize resources, strengthen healthcare systems, and contribute to global health goals.
Key words: - Explainable ML techniques, Hyperparameter Optimization, Malaria Outbreaks,
SMOTE,