<strong>Paper Title</strong><br>

Predictive Modeling of Diabetes Using Machine Learning<br>

<br>


<strong>Abstract</strong><br>

Diabetes has become a global epidemic, necessitating innovative approaches for early detection and management. This research employs machine learning techniques to predict diabetes by leveraging two distinct datasets—precisely, a focused dataset of Pima Indian heritage females and a comprehensive dataset encompassing diverse medical and demographic information. Our methodology involves meticulous data cleaning, balancing outcomes with SMOTE, and extensive data visualization. Eight machine learning models are assessed, namely, Logistic Regression, SVC, Ada Boost Classifier, K Neighbors Classifier, Gaussian Naïve Bayes and few others. Model selection is done by Cross Validation and metrics such as accuracy, precision, recall and F1 score are calculated for each of the models. Random Forest and Gradient Boosting emerged as the most effective models in predicting diabetes in the focused dataset of Pima Indian heritage females and the comprehensive dataset respectively. Confusion matrices were plotted to measure the performance of these classification models. In the Pima dataset, surprising insights challenge conventional age-diabetes correlations, while the second dataset reinforces established patterns. The study emphasizes population-specific considerations in diabetes prediction models and advocates for tailored approaches. Combining diverse datasets enhances the robustness of our models, paving the way for accurate and personalized diabetes prediction.

Keywords - Diabetes, Machine Learning, Confusion Matrix, Accuracy