Deep Learning for Credit Card Fraud Detection

Comparing supervised, unsupervised, and hybrid models on the ULB dataset — with drift analysis and SHAP interpretability.

Logistic RegressionFeed-Forward NetworkIsolation ForestAutoencoderEnsembleChronological SplitConcept DriftClass ImbalanceBootstrap CISHAPPyTorchscikit-learnpandasmatplotlibMLOps SimulationPSI Drift MonitoringCanary RolloutFederated Learning

PR curves for all six models with 95% bootstrap confidence intervals. ENS achieves the highest PR-AUC (0.772); unsupervised models fall near the no-skill baseline.

Abstract

Payment card fraud across the European Economic Area reached EUR 4.2 billion in 2024. This project evaluates six model configurations on the ULB dataset — logistic regression, feed-forward network, isolation forest, autoencoder, and two hybrid extensions — under a strict chronological split that simulates deployment conditions. PR-AUC is adopted as the primary metric given extreme class imbalance (1:578). The FFN model (PR-AUC 0.766) remains statistically indistinguishable from logistic regression (McNemar's p = 0.58), while the LR+FFN ensemble pushes PR-AUC further to 0.772 — suggesting deep learning's advantages are operational rather than purely predictive. The project also simulates a production MLOps lifecycle — a versioned model registry with CI/CD promotion gates, PSI-based drift monitoring, and a champion/challenger canary rollout with statistical rollback — and benchmarks federated (FedAvg) against centralized and incremental retraining strategies.

Video

Live Demo

Interactive fraud detection system — submit a transaction to see the risk score, model decision, and SHAP feature breakdown in real time.

Highlights

SHAP feature importance — SHAP importance mirrors KS separability (ρ = +0.727), confirming the FFN learned genuinely discriminative features.

Error analysis — 23% of fraud cases missed by all models simultaneously — a data-level floor no architecture can resolve alone.

Drift mitigation — Only score-level ensembling improved drift performance; sliding-window and time-weighted methods worsened due to data scarcity.

Incremental and federated learning comparison — Incremental retraining across five chronological blocks tracks full retraining within ~0.02 PR-AUC; federated FedAvg training reaches PR-AUC 0.750 versus 0.766 centralized, a 2.1% gap demonstrating a viable privacy-preserving deployment path.

MLOps drift monitoring dashboard — A simulated MLOps lifecycle: a CI/CD registry gates promotion by PR-AUC, PSI-based monitoring tracks drift across five windows (0.167 → 0.183, against a 0.25 alert threshold), and a champion/challenger canary rollout — validated with McNemar's test (p = 1.0) — backs an automatic rollback rule.

Results

Model	Paradigm	PR-AUC	F1	Recall@P=0.9
LR	Supervised ML	0.692	0.753	0.673
FFN	Supervised DL	0.766	0.813	0.731
IF	Unsupervised ML	0.054	0.133	0.000
AE	Unsupervised DL	0.069	0.129	0.000
W1B	Semi-supervised	0.770	0.821	0.750
ENS	Ensemble	0.772	0.821	0.750

Test set: n = 42,722, 52 fraud cases. Threshold targets Precision ≥ 0.90.

Full analysis with all 18 figures, drift experiments, and SHAP deep-dive: