A production-grade, real-time fraud detection system built as a portfolio project across 5 phases. Each phase is independently deployable and builds on the previous, demonstrating the full lifecycle of a modern data engineering and ML platform.
This platform simulates a financial institution's fraud detection infrastructure — ingesting payment transactions at scale, transforming them through a lakehouse architecture, training and serving a machine learning model in real time, and exposing operational dashboards for monitoring.
The system handles the full data lifecycle:
- Streaming ingestion at configurable TPS (transactions per second)
- ACID-compliant lakehouse storage with Delta Lake
- Batch transformation through a medallion architecture (Bronze → Silver → Gold)
- ML model training, versioning, and real-time scoring
- Operational observability with metrics, alerts, and dashboards
| Layer | Technology | Role |
|---|---|---|
| Ingestion | Apache Kafka 7.7 | Distributed event streaming |
| Schema | Confluent Schema Registry | Avro schema enforcement + evolution |
| Storage | MinIO | S3-compatible object store (Delta Lake backend) |
| Table Format | Delta Lake | ACID transactions on Parquet files |
| Stream Processing | Apache Spark 3.5 Structured Streaming | Kafka → Delta Lake writer |
| SQL Engine | Trino 435 | Federated SQL over Delta Lake |
| Transformation | dbt (dbt-trino) | SQL-based Bronze → Silver → Gold transforms |
| Orchestration | Apache Airflow 2.8 | Pipeline scheduling and dependency management |
| ML Tracking | MLflow 2.10 | Experiment tracking + model registry |
| Feature Store | Redis 7 | Sub-millisecond real-time feature lookup |
| Serving | FastAPI + uvicorn | Real-time fraud scoring REST API |
| Metrics | Prometheus | Metrics collection and alerting |
| Dashboards | Grafana | Operational and ML monitoring dashboards |
| BI | Apache Superset | Self-serve business analytics |
| Infrastructure | Docker Compose | Full-stack local orchestration |
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW │
│ │
│ [Python Producer] │
│ │ Avro-serialised transactions @ configurable TPS │
│ ▼ │
│ [Kafka: transactions.raw] ──bad messages──► [transactions.dlq] │
│ │ │
│ │ Spark Structured Streaming (micro-batch every 10s) │
│ ▼ │
│ [Quality Checks] ──failed──► [transactions.dlq] │
│ │ passed │
│ ▼ │
│ [Delta Lake Bronze] ── Parquet partitioned by event_date ──► [MinIO] │
│ │ │
│ │ Airflow DAG (every 15 min) → dbt │
│ ▼ │
│ [Silver: stg_transactions + int_transactions_enriched] │
│ │ type casts, merchant join, risk flags, rolling windows │
│ ▼ │
│ [Gold: mart_fraud_features, mart_hourly_metrics, mart_user_risk] │
│ │ │
│ ├──────────────────────────────────────┐ │
│ │ feature_writer.py (Airflow step) │ train.py (weekly) │
│ ▼ ▼ │
│ [Redis Feature Store] [MLflow Model Registry] │
│ │ user_features:{id} Production model │
│ │ │ │
│ └──────────────┬───────────────────────┘ │
│ ▼ │
│ [FastAPI /predict] │
│ │ fraud_probability, risk_tier, latency_ms │
│ ▼ │
│ [Predictions Log] ──► Postgres │
│ │ │
│ ▼ │
│ [Prometheus] ◄── /metrics ── [fraud-api, kafka-exporter] │
│ │ │
│ ▼ │
│ [Grafana] Pipeline Health + ML Monitoring dashboards │
│ [Superset] Business analytics over Gold tables via Trino │
└─────────────────────────────────────────────────────────────────────────┘
| Service | URL | Credentials |
|---|---|---|
| Kafka UI | http://localhost:8080 | — |
| Schema Registry | http://localhost:8081 | — |
| Airflow | http://localhost:8082 | admin / admin123 |
| Trino | http://localhost:8085 | any username, no password |
| Spark Master UI | http://localhost:8090 | — |
| MinIO Console | http://localhost:9001 | minioadmin / minioadmin123 |
| MLflow | http://localhost:5000 | — |
| Fraud API | http://localhost:8000 | — |
| Fraud API Docs | http://localhost:8000/docs | — |
| Prometheus | http://localhost:9090 | — |
| Grafana | http://localhost:3000 | admin / admin |
| Superset | http://localhost:8088 | admin / admin123 |
fraud-platform/
├── docker-compose.yml # Full stack, grows each phase
├── Makefile # make up / down / test / phase{N}-up
├── .env.example # All config — copy to .env
│
├── phase1-ingestion/ # Kafka + Avro producer
│ ├── producer/
│ │ ├── producer.py # Transaction generator
│ │ ├── schemas/transaction.avsc
│ │ └── Dockerfile
│ └── tests/
│
├── phase2-streaming/ # Spark Structured Streaming
│ ├── spark_jobs/
│ │ ├── bronze_writer.py # Kafka → Delta Bronze
│ │ └── quality_checks.py # Validation rules
│ └── tests/
│
├── phase3-transforms/ # dbt + Airflow + Trino
│ ├── dbt_project/
│ │ ├── models/
│ │ │ ├── staging/ # Bronze → typed views (Silver)
│ │ │ ├── intermediate/ # Enrichment + aggregation (Silver)
│ │ │ └── marts/ # ML features + metrics (Gold)
│ │ ├── seeds/merchants.csv # Reference data
│ │ ├── tests/ # Custom SQL quality tests
│ │ └── macros/ # generate_schema_name, rolling_window
│ └── airflow/
│ └── dags/fraud_pipeline.py # Orchestration DAG
│
├── phase4-ml/ # ML lifecycle
│ ├── features.py # Shared feature definitions (training + serving)
│ ├── training/
│ │ ├── train.py # RandomForest → MLflow
│ │ └── evaluate.py
│ ├── serving/
│ │ ├── main.py # FastAPI app
│ │ ├── predictor.py # MLflow model loader + scorer
│ │ ├── feature_client.py # Redis lookup + fallback
│ │ └── feature_writer.py # Gold → Redis (Airflow-triggered)
│ └── tests/
│
├── phase5-analytics/ # Observability + BI
│ ├── grafana/
│ │ ├── provisioning/ # Auto-loaded datasources + dashboards
│ │ └── dashboards/ # Pipeline health + ML monitoring JSON
│ └── superset/
│ └── superset_config.py
│
└── infra/
├── minio/init_buckets.sh
└── trino/
├── catalog/delta.properties
├── config.properties
└── jvm.config
- Docker Desktop or Docker Engine with Compose v2
- 16 GB RAM recommended (all phases running simultaneously)
- 20 GB free disk space
git clone <repo-url>
cd fraud-platform
cp .env.example .env
# Edit .env — at minimum generate a Fernet key for Airflow:
# python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"make phase1-up # Kafka, Schema Registry, Kafka UI, Producer
make phase2-up # MinIO, Spark, Delta Lake streaming
make phase3-up # Postgres, Trino, Airflow, dbt
make phase4-up # MLflow, Redis, Fraud API
make phase5-up # Prometheus, Grafana, SupersetOr start everything at once:
make upmake testGoal: Get transactions flowing through Kafka with schema enforcement.
What's built:
- Python producer generating realistic payment transactions at configurable TPS (default 100/s)
- Avro schema registered in Schema Registry before any messages are sent
- Dead Letter Queue (DLQ) — bad messages are never dropped, they're routed to
transactions.dlqfor inspection - 10 Kafka partitions enabling 10 parallel consumers
- Kafka UI for live topic monitoring
Key files:
phase1-ingestion/producer/schemas/transaction.avsc— the data contractphase1-ingestion/producer/producer.py— transaction generator with DLQ routing
What you learn:
- Kafka topics, partitions, consumer groups
- Why Avro beats JSON for streaming (schema enforcement, compact binary encoding)
- Schema evolution and backward compatibility
- Dead letter queue pattern for fault-tolerant ingestion
Validation:
# See messages flowing
open http://localhost:8080 # Kafka UI → transactions.raw
# Check schema registered
curl http://localhost:8081/subjectsGoal: Land every Kafka transaction into Delta Lake on MinIO with quality guarantees.
What's built:
- Spark Structured Streaming job reading
transactions.raw - Quality checks validate each row before landing (amount > 0, user_id not null, valid timestamp)
- Invalid rows routed to
transactions.dlq— never dropped silently - Delta Lake writes partitioned by
event_datefor efficient time-range queries - Checkpointing to MinIO — crash-safe, exactly-once delivery
Architecture decision — why Delta Lake:
- ACID transactions: partial writes never appear to readers
- Time travel: query any previous version of the table
- Schema evolution: add columns without breaking existing readers
- Unified batch + streaming reads
Key files:
phase2-streaming/spark_jobs/bronze_writer.py— main streaming jobphase2-streaming/spark_jobs/quality_checks.py— validation rules
What you learn:
- Micro-batch streaming vs continuous processing
- Watermarking for late-arriving events
- Checkpointing and Kafka offset management
- Medallion architecture — why raw Bronze data is preserved unmodified
- Backpressure via
maxOffsetsPerTrigger
Validation:
open http://localhost:9001 # MinIO → fraud-platform/bronze/transactions/
open http://localhost:8090 # Spark UI → active streaming queryGoal: Transform raw Bronze data into analytics-ready Silver and Gold layers. Schedule everything with Airflow.
What's built:
- Trino as the SQL query engine over Delta Lake files in MinIO
- dbt models across three layers:
- Staging (Silver): type casts, renames — no business logic
- Intermediate (Silver): merchant join, risk flags, 30-day rolling aggregations per user
- Marts (Gold): ML feature table, hourly metrics, user risk profiles
- Airflow DAG running every 15 minutes:
bronze_freshness_check → dbt_seed → dbt_staging → dbt_intermediate → dbt_marts → dbt_test → push_features_to_redis - Custom dbt tests: fraud rate sanity check, no-future-events guard
merchants.csvseed — reference data joined to every transaction
Medallion layer mapping:
| Layer | Trino Schema | What it contains |
|---|---|---|
| Bronze | delta.bronze |
Raw Spark-written Parquet |
| Silver | delta.silver |
Typed views: stg_transactions, int_enriched, int_velocity |
| Gold | delta.gold |
Incremental tables: mart_fraud_features, mart_hourly_metrics, mart_user_risk_profiles |
Key design: why incremental materialization
Mart models use incremental (INSERT INTO) rather than table (CREATE + rename) because MinIO/S3 does not support atomic directory rename. Incremental writes are safe and idempotent via unique_key.
Key files:
phase3-transforms/dbt_project/models/— all SQL transformsphase3-transforms/airflow/dags/fraud_pipeline.py— orchestration DAGinfra/trino/catalog/delta.properties— connects Trino to MinIO
What you learn:
- dbt materializations: view vs table vs incremental
- Incremental models:
is_incremental(),unique_key, schema change handling - Airflow DAG dependencies, retries, SLA monitoring
- Trino as a query federation layer — SQL over files without moving data
- Semantic layer: defining metrics as code, not in dashboards
Validation:
open http://localhost:8082 # Airflow → fraud_pipeline DAG running
docker compose exec trino trino --execute \
"SELECT COUNT(*) FROM delta.bronze_marts.mart_fraud_features"Goal: Train a fraud classifier, register it, and serve real-time predictions through a production API.
What's built:
- Training pipeline: RandomForest trained on
mart_fraud_features(30-day user behaviour features). Logs to MLflow. Auto-promotes to Staging if AUC ≥ 0.90 - MLflow model registry: None → Staging → Production → Archived lifecycle. Model artifacts stored in MinIO
- Redis feature store:
feature_writer.pypushes user feature vectors to Redis after every dbt Gold run. Keys expire after 24h - FastAPI
/predict: Looks up user features from Redis (< 1ms), scores with loaded model, returnsfraud_probability,risk_tier,model_version,latency_ms - Circuit breaker: Redis timeout or failure → fallback to safe defaults. API never goes down due to cache failure
- Prediction logging: Every prediction written to Postgres
predictionstable for audit and retraining
The train-serve skew problem — and the solution:
features.py is a single shared file imported by both train.py and feature_client.py. The same UserFeatures.to_array() method computes features identically in training and serving. This is enforced at the Docker build level — the Dockerfile copies the root features.py into the serving container.
Feature pipeline:
Gold: mart_fraud_features (Trino)
↓ feature_writer.py (every 15 min)
Redis: user_features:{user_id} = JSON blob
↓ feature_client.py (every /predict call)
UserFeatures.to_array() → numpy array → RandomForest → float
Key files:
phase4-ml/features.py— shared feature definitions (the contract)phase4-ml/training/train.py— MLflow-instrumented trainingphase4-ml/serving/main.py— FastAPI endpointsphase4-ml/serving/predictor.py— model loader + scorerphase4-ml/serving/feature_client.py— Redis lookup with fallbackphase4-ml/serving/feature_writer.py— Gold → Redis writer
What you learn:
- MLflow: experiment tracking, artifact storage, model registry stages
- Train-serve skew: why shared feature code is non-negotiable
- Circuit breaker pattern for cache-dependent services
- Redis as a feature store: memory-first, key-value, sub-millisecond reads
- FastAPI: async endpoints, Pydantic models, startup events
Validation:
open http://localhost:5000 # MLflow → fraud_classifier in Production
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"transaction_id":"t1","user_id":"U1234","amount":500,
"merchant_id":"M001","is_international":false}'
# Kill Redis → API still responds (fallback)
docker compose stop redis
curl -X POST http://localhost:8000/predict ... # still works
docker compose start redisGoal: Full observability stack live. Dashboards for operations and business. System hardened for reliability.
What's built:
- Prometheus: Scrapes
fraud-api /metrics(prediction counters, latency histograms) andkafka-exporter(topic message rates, consumer lag, DLQ counts) - Grafana dashboards:
- Pipeline Health: Kafka messages/sec, consumer lag per partition, DLQ alert, Bronze table freshness
- ML Monitoring: Prediction volume, P50/P95/P99 latency, fraud rate trend, Redis cache hit rate, model version in production
- Superset: Connected to Trino, queries Gold tables directly. Three dashboards: Transaction Overview, Fraud Analytics, Real-Time Risk Feed
- kafka-exporter: Sidecar that exposes Kafka broker and consumer group metrics in Prometheus format
Instrumentation in fraud-api:
fraud_predictions_total{risk_tier} — counter, predictions by risk level
fraud_prediction_latency_seconds — histogram, P50/P95/P99 available
fraud_model_version — gauge, currently loaded model version
fraud_redis_cache_hits_total — counter, cache vs fallback ratio
Alerting philosophy — alert on symptoms, not causes:
- DLQ count > 0 in 5 min window → pipeline data quality issue
- API error rate > 1% → serving reliability issue
- Fraud rate > 2x 7-day baseline → model drift or attack pattern
What you learn:
- Prometheus pull model: scrape intervals, labels, metric types (counter, gauge, histogram)
- Grafana: datasource provisioning, dashboard-as-code, alert rules
- Four golden signals: latency, traffic, errors, saturation
- Model drift detection: when to alert vs when to retrain
- Superset semantic layer: define datasets once, analysts self-serve
Validation:
open http://localhost:9090/targets # Prometheus → all targets UP
open http://localhost:3000 # Grafana → both dashboards with data
open http://localhost:8088 # Superset → SQL Lab against Gold tables
# Trigger alert: flood DLQ
for i in {1..50}; do
docker compose exec kafka kafka-console-producer \
--bootstrap-server localhost:29092 --topic transactions.dlq <<< "bad-message-$i"
donemake up # Start all services
make down # Stop all services
make clean # Stop + remove all volumes (destructive)
make build # Rebuild all images
make phase1-up # Start Phase 1 services only
make phase2-up # Start Phase 2 services only
make phase3-up # Start Phase 3 services only
make phase4-up # Start Phase 4 services only
make phase5-up # Start Phase 5 services only
make test # Run all phase tests
make phase1-test # Phase 1 unit tests (inside Docker)
make phase2-test # Phase 2 Spark tests (inside Docker)
make phase3-test # dbt schema + custom tests
make phase4-test # ML feature + serving tests
make ml-train # Trigger a training run
make predict-test # Send one test prediction to fraud-api
make dbt-run # Run dbt seed + run manually
make dbt-test # Run dbt tests manually
make kafka-lag # Show consumer group lag
make redis-features # List Redis feature keysWhy Avro over JSON? Schema enforcement at write time, not read time. The Schema Registry rejects malformed messages before they enter the pipeline. Avro binary encoding is ~30% smaller than equivalent JSON.
Why Delta Lake over raw Parquet? ACID transactions mean partial writes never appear to readers. Time travel enables point-in-time recovery. Schema evolution is non-breaking. Spark and Trino both read Delta natively.
Why dbt over raw SQL scripts?
Dependency management, incremental materialisation, built-in testing, and a semantic layer. dbt test runs 30+ data quality assertions on every pipeline run. SQL is version-controlled and peer-reviewable.
Why Redis for the feature store? A Trino query over Parquet files takes 300–2000ms. Redis GET is < 1ms. At 500 predictions/sec, that's the difference between a usable API and a non-starter. Gold tables remain the source of truth; Redis is the speed layer.
Why share features.py between training and serving? Train-serve skew is one of the most common sources of ML model degradation in production. If the feature engineering code diverges between training and serving, the model sees different inputs at inference time than it was trained on. Sharing one file makes this divergence impossible at the code level.
- Streaming: Kafka partitioning, consumer groups, Schema Registry, DLQ patterns
- Lakehouse: Delta Lake ACID writes, checkpointing, watermarking, medallion architecture
- Data Modelling: dbt staging/intermediate/mart separation, incremental models, semantic layer
- Orchestration: Airflow DAG design, task dependencies, retry policies
- ML Engineering: MLflow tracking, model registry, feature stores, prediction logging
- API Design: FastAPI, circuit breakers, fallback patterns, Prometheus instrumentation
- Observability: Four golden signals, alert-on-symptoms philosophy, dashboard-as-code
- Infrastructure: Docker Compose multi-network, health checks, volume management