← All projects

Machine Learning for Diabetes Risk Prediction

Classification and population segmentation on the CDC BRFSS 2015 survey (253,680 respondents) — individual risk screening via Logistic Regression, and unsupervised health-profile discovery via K-means clustering.

Logistic RegressionRandom ForestK-meansRFECVSMOTEClass WeightingThreshold OptimisationIsotonic CalibrationPCASHAPscikit-learnpandasmatplotlib
Cluster health profiles — standardised means heatmap
Fig 16. Cluster health profiles (standardised means heatmap). Cluster 2 — High Clinical Burden — shows consistently elevated risk across every dimension and the highest diabetes prevalence. The three clusters map directly onto actionable intervention tiers: preventive, lifestyle-change, and clinical management.

Abstract

The CDC BRFSS 2015 survey records 253,680 respondents across 21 health, lifestyle, and demographic features. Diabetes prevalence is heavily imbalanced, motivating cost-sensitive evaluation and threshold optimisation throughout. This project pursues two objectives under the CRISP-DM framework: individual-level risk prediction through supervised classification, and population-level segmentation through unsupervised clustering.

The final Logistic Regression model achieves ROC-AUC 0.818 and at-risk recall 0.761 at an operating threshold of t = 0.48 — selected not for raw accuracy, but for the flatness of its cost curve, which makes deployment behaviour robust to threshold variation. SHAP confirms GenHlth, Age, and BMI as the three dominant drivers. K-means (k = 3) recovers three clinically differentiated population strata, validated externally by chi-square against diabetes labels.

Key Findings

0.818 ROC-AUC — Final LR, stable across CV folds (CV: 0.8177, test: 0.8176)
0.761 At-risk recall at t = 0.48 — fewer false negatives than RF at equivalent precision
k = 3 Stable silhouette score across PCA(6) and PCA(10) — selected for robustness, not peak score
Top 3 SHAP drivers: GenHlth, Age, BMI — consistent with clinical literature on T2D risk factors

Highlights

Cost-sensitive threshold analysis: LR vs RF
LR over RF — for deployment robustness, not AUC Fig 10. LR reaches minimum expected cost at t = 0.48 with a flat, smooth curve. RF requires t = 0.16 with a steep profile — highly sensitive to threshold variation. For a healthcare screening system where threshold drift is operationally inevitable, LR's stable behaviour is the decisive factor.
Calibration curves — LR vs RF with isotonic regression
Calibration required before deployment Fig 8. LR exhibits conservative probability estimates at higher risk levels. Uncalibrated RF shows substantial miscalibration; after isotonic regression, RF aligns closely with the ideal diagonal. Explicit post-hoc calibration is required before model outputs are used for individual-level risk communication.
Diabetes distribution across clusters
External validation — clusters align with diabetes prevalence Fig 17. Diabetes distribution across the three K-means clusters. Cluster 2 carries the highest diabetes rate, Cluster 0 the lowest. Chi-square test confirms the association is statistically significant — the unsupervised partition captures clinically meaningful structure without access to the diabetes label during training.

Classification Results

Model Variant ROC-AUC AP At-risk Recall F1
LR Baseline 0.818 0.433
LR Tuned · C = 0.05 · t = 0.48 0.818 0.433 0.761 0.470
RF Baseline 0.818 0.453
RF Tuned · t = 0.16 0.821 0.456 0.746 0.456
LR + SMOTE 0.815 0.753 0.468

CDC BRFSS 2015 · 253,680 respondents · Binary at-risk label (Diabetes_012 ≥ 1) · Class-weighted training · RFECV retains all 21 features. LR selected on cost-curve robustness, not AUC margin.

Full model comparison — ROC and PR curves for LR and RF
Fig 7. ROC and PR curves for tuned LR and RF. Both models achieve near-identical ROC-AUC (0.818 vs 0.821); the PR curves reveal LR's higher at-risk recall (0.761 vs 0.746) at the operating threshold, confirming LR as the preferred screening model.

Clustering Results

Cluster Label BMI GenHlth HighBP / Chol / CVD Diabetes Rate Intervention
0 Low Risk Low Good Low Lowest Preventive
1 Lifestyle-driven Moderate Moderate Low–moderate Moderate Lifestyle change
2 High Clinical Burden High Poor High Highest Clinical management

K-means (k = 3) on 14 health and lifestyle features · PCA(6) · Silhouette = 0.197 · External validity confirmed by chi-square against diabetes labels. k selected for stability across PCA representations, not peak silhouette.

Silhouette comparison — K-means vs Hierarchical clustering
Fig 15. Hierarchical clustering achieves a higher mean silhouette score (0.297 vs 0.197) but concentrates the majority of samples into a single dominant cluster. K-means produces a more balanced, interpretable partition — the basis for selecting K-means over hierarchical.

Read More

Full analysis covering RFECV feature selection, cost-sensitive threshold experiments, SMOTE vs class-weighting comparison, calibration curves, SHAP deep-dive, and hierarchical vs K-means clustering evaluation — with all 17 figures:

Read the analysis note →