Machine Learning for Diabetes Risk Prediction

Classification and population segmentation on the CDC BRFSS 2015 survey (253,680 respondents) — individual risk screening via Logistic Regression, and unsupervised health-profile discovery via K-means clustering.

Logistic RegressionRandom ForestK-meansRFECVSMOTEClass WeightingThreshold OptimisationIsotonic CalibrationPCASHAPscikit-learnpandasmatplotlib

Cluster health profiles — standardised means heatmap — Fig 16. Cluster health profiles (standardised means heatmap). Cluster 2 — High Clinical Burden — shows consistently elevated risk across every dimension and the highest diabetes prevalence. The three clusters map directly onto actionable intervention tiers: preventive, lifestyle-change, and clinical management.

Abstract

The CDC BRFSS 2015 survey records 253,680 respondents across 21 health, lifestyle, and demographic features. Diabetes prevalence is heavily imbalanced, motivating cost-sensitive evaluation and threshold optimisation throughout. This project pursues two objectives under the CRISP-DM framework: individual-level risk prediction through supervised classification, and population-level segmentation through unsupervised clustering.

The final Logistic Regression model achieves ROC-AUC 0.818 and at-risk recall 0.761 at an operating threshold of t = 0.48 — selected not for raw accuracy, but for the flatness of its cost curve, which makes deployment behaviour robust to threshold variation. SHAP confirms GenHlth, Age, and BMI as the three dominant drivers. K-means (k = 3) recovers three clinically differentiated population strata, validated externally by chi-square against diabetes labels.

Key Findings

0.818 ROC-AUC — Final LR, stable across CV folds (CV: 0.8177, test: 0.8176)

0.761 At-risk recall at t = 0.48 — fewer false negatives than RF at equivalent precision

k = 3 Stable silhouette score across PCA(6) and PCA(10) — selected for robustness, not peak score

Top 3 SHAP drivers: GenHlth, Age, BMI — consistent with clinical literature on T2D risk factors

Highlights

Cost-sensitive threshold analysis: LR vs RF — **LR over RF — for deployment robustness, not AUC** Fig 10. LR reaches minimum expected cost at t = 0.48 with a flat, smooth curve. RF requires t = 0.16 with a steep profile — highly sensitive to threshold variation. For a healthcare screening system where threshold drift is operationally inevitable, LR's stable behaviour is the decisive factor.

Calibration curves — LR vs RF with isotonic regression — **Calibration required before deployment** Fig 8. LR exhibits conservative probability estimates at higher risk levels. Uncalibrated RF shows substantial miscalibration; after isotonic regression, RF aligns closely with the ideal diagonal. Explicit post-hoc calibration is required before model outputs are used for individual-level risk communication.

Diabetes distribution across clusters — **External validation — clusters align with diabetes prevalence** Fig 17. Diabetes distribution across the three K-means clusters. Cluster 2 carries the highest diabetes rate, Cluster 0 the lowest. Chi-square test confirms the association is statistically significant — the unsupervised partition captures clinically meaningful structure without access to the diabetes label during training.

Classification Results

Model	Variant	ROC-AUC	AP	At-risk Recall	F1
LR	Baseline	0.818	0.433	—	—
LR	Tuned · C = 0.05 · t = 0.48	0.818	0.433	0.761	0.470
RF	Baseline	0.818	0.453	—	—
RF	Tuned · t = 0.16	0.821	0.456	0.746	0.456
LR	+ SMOTE	0.815	—	0.753	0.468

CDC BRFSS 2015 · 253,680 respondents · Binary at-risk label (Diabetes_012 ≥ 1) · Class-weighted training · RFECV retains all 21 features. LR selected on cost-curve robustness, not AUC margin.

Full model comparison — ROC and PR curves for LR and RF — Fig 7. ROC and PR curves for tuned LR and RF. Both models achieve near-identical ROC-AUC (0.818 vs 0.821); the PR curves reveal LR's higher at-risk recall (0.761 vs 0.746) at the operating threshold, confirming LR as the preferred screening model.

Clustering Results

Cluster	Label	BMI	GenHlth	HighBP / Chol / CVD	Diabetes Rate	Intervention
0	Low Risk	Low	Good	Low	Lowest	Preventive
1	Lifestyle-driven	Moderate	Moderate	Low–moderate	Moderate	Lifestyle change
2	High Clinical Burden	High	Poor	High	Highest	Clinical management

K-means (k = 3) on 14 health and lifestyle features · PCA(6) · Silhouette = 0.197 · External validity confirmed by chi-square against diabetes labels. k selected for stability across PCA representations, not peak silhouette.

Silhouette comparison — K-means vs Hierarchical clustering — Fig 15. Hierarchical clustering achieves a higher mean silhouette score (0.297 vs 0.197) but concentrates the majority of samples into a single dominant cluster. K-means produces a more balanced, interpretable partition — the basis for selecting K-means over hierarchical.

Full analysis covering RFECV feature selection, cost-sensitive threshold experiments, SMOTE vs class-weighting comparison, calibration curves, SHAP deep-dive, and hierarchical vs K-means clustering evaluation — with all 17 figures: