Abstract:
The aim of this research is to develop predictive model for diabetes based on risk factors and
associated diseases using ensemble machine learning. The problem addressed in this research
is to enhance the public health and take the correct action. The research emphasizes the need
for timely detection and prediction of diabetes to prevent complications and improve public
health. The study was conducted by using experimental research. The data source for this
research is the CDC, which was collected by BRFSS. The dataset was 253680 and there is
imbalanced. After applying the data pre processing tasks and class balance using random under
sampling majority class there is 70692 instances were used for the model. The attribute was
reduced to 18 from their original 21features, by using feature selection technique wrapper
method (recursive feature elimination)). To construct the best proposed model six experiments
were conducted by splitting the dataset in to train, validation and test set with the ratio of 80%,
10%, 10% respectively using Random forest, Catboost, bagging decision tree, AdaBoost,
XGBoost and Extra tree algorithms. The performance of the model were evaluate using
different evaluation parameters such as precision, recall, accuracy, F1 score, AUC and
confusion matrix. The overall accuracy of Random forest, Catboost, bagging decision tree,
AdaBoost, XGBoost and Extra tree are 90.16%, 88.94%, 88.97%, 87.87%, 88.81% and 89.86%
respectively. Random forest is the best predictive model with an accuracy of 90.16% and ROC
of 96% from the others. Model explainability is made to understand and interpret how a
machine learning model makes predictions or decisions using local interpretable model
explanations (lime).
Key words: diabetes, risk factors, associated diseases, lime, ensemble machine learning, predictive
model.