End-to-end credit card fraud detection pipeline using the Kaggle Credit Card Fraud Dataset — 284,807 transactions, 492 fraudulent (0.17% fraud rate).
creditcard.csv → PySpark Preprocessing → Feature Engineering → Model Training → FastAPI Service → Docker
│
┌────────────┴────────────┐
│ Logistic Regression │
│ XGBoost │
│ PyTorch MLP │
└─────────────────────────┘
Predictions are logged to PostgreSQL via SQLAlchemy. API request/response schemas are validated with Pydantic.
- Preprocessing: PySpark
- Modeling: scikit-learn, XGBoost, PyTorch
- Class imbalance: SMOTE + undersampling (imbalanced-learn)
- Serving: FastAPI + Uvicorn
- Storage: PostgreSQL + SQLAlchemy
- Evaluation: Precision, Recall, F1, AUC-ROC (accuracy is misleading at 0.17% fraud rate)
# Download the dataset from Kaggle and place it at data/creditcard.csv
python -m venv venv
source venv/bin/activate
pip install -r files/requirements.txt# EDA
python files/01_eda.py
# Modeling
python files/02_modeling.py
# Or run as notebooks
jupyter notebookFeatures V14, V17, V12, and V10 are the strongest fraud discriminators based on distribution separation between fraud and non-fraud classes.