The CDC BRFSS 2015 survey records 253,680 respondents with 21 features covering health status,
lifestyle, and demographics. The target variable Diabetes_012 is three-category (0: no diabetes,
1: prediabetes, 2: diabetes), transformed into a binary at-risk label (At-risk = Diabetes_012 ≥ 1)
for classification. The diabetes outcome is notably imbalanced — most individuals are classified as
no-diabetes — motivating class weighting and threshold-based evaluation.
This project addresses two objectives: individual-level risk prediction through classification, and population-level segmentation through clustering. The workflow follows CRISP-DM.

Dataset and preparation
No missing values. 9.42% of records share identical responses — retained to preserve population distribution, as no unique respondent identifier exists. All features are already numerically encoded (binary or ordinal). Standardisation applied selectively: only for logistic regression; tree-based models trained on unscaled features.
For clustering, only the 14 health status and lifestyle features were retained — they more directly reflect population health profiles than demographics. Correlation and variance-based checks found no highly correlated or near-zero variance features among these 14.
Classification models
Feature selection via RFECV (LR estimator, stratified CV, ROC-AUC scoring) found that cross-validated AUC plateaued at 0.818 with all 21 features retained — no benefit from removal. All features were kept for modelling.

Experiments included: (1) baseline vs tuned LR (GridSearchCV over C), (2) baseline vs tuned RF (RandomizedSearchCV), (3) tuned model comparison at t = 0.5, (4) class-weighted vs SMOTE, and (5) cost-sensitive threshold optimisation for healthcare deployment.


| Model | ROC-AUC | AP | At-risk Recall | F1 |
|---|---|---|---|---|
| LR baseline | 0.818 | 0.433 | — | — |
| LR tuned (C=0.05) | 0.818 | 0.433 | 0.761 | 0.470 |
| RF baseline | 0.818 | 0.453 | — | — |
| RF tuned | 0.821 | 0.456 | 0.746 | 0.456 |
| LR + SMOTE | 0.815 | — | 0.753 | 0.468 |
CV ROC-AUC for final LR: 0.8177 (test: 0.8176) — stable generalisation. Tuning yields no meaningful improvement for either model.
LR vs RF

Under t = 0.5, LR achieves higher at-risk recall (0.761 vs 0.746) — fewer false negatives, more critical for the screening objective. The real distinction comes from the cost curve. LR reaches minimum expected cost at t = 0.48 with a flat, smooth curve — robust to threshold variation. RF requires t = 0.16 and its cost curve is steep, highly sensitive to small threshold changes.

LR selected as the final model. Not for AUC, but for stable and interpretable operating behaviour suited to screening where missed at-risk cases are costly.
Class weighting vs SMOTE

SMOTE-based LR closely matches class-weighted LR (ROC-AUC 0.815 vs 0.818, recall 0.753 vs 0.761, F1 0.468 vs 0.470). No meaningful gain — class weighting retained given lower complexity.
Threshold sensitivity

Comparison and calibration

Neither model is well-calibrated across the full probability range. LR exhibits conservative estimates at higher predicted risk levels. Uncalibrated RF shows substantial miscalibration; after isotonic regression, RF probabilities align more closely with the ideal calibration diagonal. Explicit calibration is required before outputs are used for individual-level risk communication.

Final model: LR tuned at t = 0.48. ROC-AUC 0.8176, at-risk recall 0.7606, accuracy 0.7303, 1,914 false negatives.
SHAP interpretability
SHAP confirms predictions are driven by a small set of health-related factors with clear directional effects.
| Feature | Mean SHAP | Direction |
|---|---|---|
| GenHlth | 0.50 | Poorer self-reported health → ↑ risk |
| Age | 0.38 | Older age → ↑ risk |
| BMI | 0.35 | Higher BMI → ↑ risk |
| HighBP | 0.34 | Hypertension present → ↑ risk |
| HighChol | 0.29 | High cholesterol → ↑ risk |

At the individual level, high-risk classifications arise from the cumulative contribution of dominant factors rather than any single extreme value. Protective attributes — absence of chronic conditions — partially offset risk but rarely dominate.
Clustering models
k selection
k=2 achieves a high silhouette score but represents only a coarse binary separation. k=6 yields slightly higher scores under PCA(6) but is inconsistent under PCA(10) — suggesting over-segmentation. k=3 is stable across both representations and was selected, balancing quality, robustness, and interpretability.


PCA(6) vs PCA(10), K-means vs Hierarchical
| Method | Silhouette | WCSS | Note |
|---|---|---|---|
| K-means PCA(6) k=3 | 0.197 | 1,440,681 | Smooth, balanced — selected |
| K-means PCA(10) k=3 | similar | higher | Cluster 2 compressed, less balanced |
| Hierarchical PCA(6) k=3 | 0.297 | higher | One dominant cluster, size imbalance |
PCA(6) shows smoother transitions between clusters along PC1, which better reflects gradual health variation. Hierarchical clustering achieves a higher silhouette score but concentrates most samples into a single dominant cluster — K-means produces more evenly sized, interpretable partitions.



Cluster health profiles
The three clusters exhibit clearly differentiated health patterns and distinct diabetes risk compositions (chi-square confirms external validity).
| Cluster 0 | Cluster 1 | Cluster 2 | |
|---|---|---|---|
| Label | Low Risk | Lifestyle-driven | High Clinical Burden |
| BMI | Low | Moderate | High |
| GenHlth / PhysHlth | Good | Moderate | Poor |
| HighBP / HighChol / CVD | Low | Low–moderate | High |
| Physical activity | Moderate | Low | Low |
| Smoking / Alcohol | Low | High | Moderate |
| Diabetes composition | Lowest | Moderate | Highest |
| Intervention | Preventive | Lifestyle change | Clinical management |


Cluster 0 shows generally favourable health indicators — lower BMI, fewer chronic conditions, and better self-reported health — representing a low-risk profile suited to preventive health maintenance. Cluster 1 is characterised primarily by lifestyle-related differences, particularly in physical activity, smoking, and dietary behaviours, despite relatively moderate clinical risk, indicating a need for behaviour-focused lifestyle interventions. Cluster 2 displays consistently elevated levels across multiple adverse health indicators, including BMI, poor general health, cardiovascular conditions, and mobility limitations — a high-risk profile driven by accumulated clinical burden that warrants targeted clinical management and monitoring.
The clear separation of health profiles across all three clusters supports the external validity of the clustering solution and its potential application in population-level diabetes risk stratification.
Reflection
Why LR over RF? Not AUC — both are comparable. The decision rests on the cost curve shape. LR’s flat minimum around t ≈ 0.48 means deployment threshold can be varied without sharply increasing cost. RF’s steep minimum at t ≈ 0.16 is fragile.
Why k=3? Not the highest silhouette — but the most stable across PCA representations. As Hennig (2007) notes: a meaningful cluster should not disappear easily when the data is changed in a non-essential way.
Limitations: BRFSS is cross-sectional — causality cannot be inferred. Both models lack reliable probability calibration across the full risk range. Clustering is sensitive to feature selection scope and PCA dimensionality choice.