Skip to content

RyRy241/Distributed-Fraud-Detection-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Fraud Detection Pipeline

End-to-end credit card fraud detection pipeline using the Kaggle Credit Card Fraud Dataset — 284,807 transactions, 492 fraudulent (0.17% fraud rate).

Architecture

creditcard.csv → PySpark Preprocessing → Feature Engineering → Model Training → FastAPI Service → Docker
                                                                      │
                                                         ┌────────────┴────────────┐
                                                         │  Logistic Regression     │
                                                         │  XGBoost                 │
                                                         │  PyTorch MLP             │
                                                         └─────────────────────────┘

Predictions are logged to PostgreSQL via SQLAlchemy. API request/response schemas are validated with Pydantic.

Stack

  • Preprocessing: PySpark
  • Modeling: scikit-learn, XGBoost, PyTorch
  • Class imbalance: SMOTE + undersampling (imbalanced-learn)
  • Serving: FastAPI + Uvicorn
  • Storage: PostgreSQL + SQLAlchemy
  • Evaluation: Precision, Recall, F1, AUC-ROC (accuracy is misleading at 0.17% fraud rate)

Setup

# Download the dataset from Kaggle and place it at data/creditcard.csv

python -m venv venv
source venv/bin/activate
pip install -r files/requirements.txt

Usage

# EDA
python files/01_eda.py

# Modeling
python files/02_modeling.py

# Or run as notebooks
jupyter notebook

Key Findings (EDA)

Features V14, V17, V12, and V10 are the strongest fraud discriminators based on distribution separation between fraud and non-fraud classes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors