My favorite moment in any data project is the one where a "clean" result turns out not to be so clean after all. I'm a Data Scientist building end-to-end, production-quality machine learning pipelines on real-world data — and in every project, I spend real time getting to know the data before I write a single line of modeling code.
I'm actively building a portfolio around real, messy, far-from-ideal datasets. The goal isn't just to train a model — it's to surface and fix the real bugs that show up during data collection, cleaning, and validation. Every repo follows the same principles:
- End-to-end, production-grade pipelines with no placeholders
- A validation-first methodology — profiling the data before touching the model
- Honest, transparent reporting of results (including data limitations)
- LAPD Crime Analysis — Spatiotemporal crime hotspot detection using KDE and Getis-Ord Gi*, paired with a Fairlearn-based fairness audit of a LightGBM case-clearance model. The model looked "fair" mainly because it was uniformly pessimistic across every demographic group — that was the real finding.
- Semantic Search / RAG Engine (from scratch) — A RAG system built without LangChain or LlamaIndex: BM25, dense, and hybrid retrieval (via Qdrant), with a full evaluation suite (Recall@k, MRR, nDCG).
- Statistical Detection of Global Warming — Testing the statistical significance of warming trends using Mann-Kendall, Theil-Sen estimators, and AR(1)-corrected OLS.
- METABRIC Survival — XGBoost + SHAP pipeline for 5-year breast cancer survival prediction, handling medical data imbalance with ADASYN.
- Explainable Airline Sentiment — An NLP pipeline that doesn't just predict, but explains why — using XAI to surface the reasoning behind each decision.
Python · scikit-learn · XGBoost / LightGBM · SHAP · Fairlearn · Qdrant · LangChain / LangGraph · pandas · PySAL · statsmodels · TensorFlow
I share completed projects on LinkedIn, explaining the technical depth in plain language.
⭐️ If a repo catches your eye, feel free to dig in — every one of them is built on real data, real bugs, and real fixes.