A full-stack experimentation platform for designing, running, and analyzing controlled experiments. Covers the full lifecycle: power analysis before data collection, statistical inference after, and sequential monitoring during. Built from scratch with a custom statistical engine rather than relying on black-box libraries.
┌─────────────────────────────────────────────────────────┐
│ React Frontend │
│ Dashboard │ Experiment Wizard │ Results & Visualizations│
└─────────────────────┬───────────────────────────────────┘
│ REST (Axios)
┌─────────────────────▼───────────────────────────────────┐
│ FastAPI Backend │
│ │
│ /experiments /stats/run /stats/power /upload │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Statistical Engine │ │
│ │ │ │
│ │ hypothesis_tests.py power_analysis.py │ │
│ │ sequential.py │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────┬───────────────────────────────────┘
│ SQLAlchemy ORM
┌─────────────────────▼───────────────────────────────────┐
│ PostgreSQL 15 │
│ experiments table (JSON results column) │
└─────────────────────────────────────────────────────────┘
| Layer | Technology |
|---|---|
| Frontend | React 18, Vite, React Router v6, Recharts, Axios |
| Backend | Python 3.11, FastAPI 0.111, Uvicorn |
| Database | PostgreSQL 15, SQLAlchemy 2.0, psycopg2 |
| Statistics | SciPy 1.14, statsmodels 0.14, NumPy 2.1, pandas 2.2 |
| Infrastructure | Docker, Docker Compose |
Pre-experiment sample size calculation. Given a baseline rate, minimum detectable effect (MDE), significance level α, and desired power 1-β, the required sample size per variant is:
Where δ is the standardized effect size. For proportions, Cohen's h:
For continuous metrics, Cohen's d:
The platform generates a full power curve across a range of MDE values, surfacing the tradeoff between effect sensitivity and data requirements.
Two-proportion z-test for binary outcomes (conversions, clicks). Under H₀: p₁ = p₂, the test statistic uses a pooled estimate of the common proportion:
The confidence interval uses unpooled SE because post-test we are estimating the actual difference between rates, not the null hypothesis value.
Welch t-test for numeric metrics. Explicitly does not assume σ₁² = σ₂², which is almost never valid in practice. The test statistic:
Degrees of freedom via Welch-Satterthwaite approximation:
Using Student t-test when variances differ produces anti-conservative p-values. Welch is strictly more correct here.
Two-stage procedure to control Type I error inflation from multiple comparisons.
Stage 1 — Omnibus test
For proportions: Pearson chi-squared test on the full contingency table.
For continuous: One-way ANOVA via F-statistic:
Stage 2 — Pairwise comparisons with correction
Running k pairwise tests at α each yields a family-wise error rate (FWER) of:
At α=0.05 with 3 comparisons, FWER ≈ 14.3%. Holm-Bonferroni correction controls FWER at α without being as conservative as full Bonferroni. Procedure: sort p-values p₁ ≤ p₂ ≤ ... ≤ pₖ and reject H₀ᵢ if:
Standard fixed-sample tests are statistically invalid if you peek at results mid-experiment and stop early. Each additional look inflates the false positive rate. The correct solution is Wald's Sequential Probability Ratio Test (1945).
At each observation, compute the log likelihood ratio between H₁ (effect exists) and H₀ (no effect). For Bernoulli observations:
Define two boundaries from the error rates α (Type I) and β (Type II):
At each step:
- If Λₙ ≥ A → reject H₀, declare treatment the winner
- If Λₙ ≤ B → accept H₀, declare no effect
- If B < Λₙ < A → continue collecting data
The platform tracks the full LLR trajectory and plots it against both boundaries in real time. Wald proved this procedure controls both error rates simultaneously while minimizing expected sample size under both hypotheses.
ab-test-platform/
├── backend/
│ ├── main.py # FastAPI app, CORS middleware, lifespan
│ ├── db.py # SQLAlchemy ORM models, session factory
│ ├── engine/
│ │ ├── __init__.py
│ │ ├── hypothesis_tests.py # z-test, Welch t-test, ANOVA, Holm-Bonferroni
│ │ ├── power_analysis.py # sample size calculator, MDE power curves
│ │ └── sequential.py # SPRT with full LLR trajectory
│ ├── routers/
│ │ ├── experiments.py # CRUD, CSV ingestion, JSON column storage
│ │ ├── stats.py # test dispatch, metric type inference
│ │ └── upload.py # CSV preview, column type inference, validation
│ └── requirements.txt
├── frontend/
│ └── src/
│ ├── pages/
│ │ ├── Dashboard.jsx # experiment list, run/delete actions
│ │ ├── NewExperiment.jsx # 3-step wizard, auto column detection
│ │ └── Results.jsx # stat cards, uplift chart, CI, SPRT plot, power curve
│ ├── api.js # axios base client, all endpoint calls
│ └── App.jsx # router, sidebar layout
├── data/
│ └── sample_experiment.csv # 5000 rows, 3 variants, baked lift signals
├── generate_sample_data.py
└── docker-compose.yml
Prerequisites: Python 3.11, Node.js 18+, PostgreSQL 15
Backend
cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000Frontend
cd frontend
npm install
npm run devWith Docker
docker-compose upApp at http://localhost:5173. API docs at http://localhost:8000/docs.
Synthetic experiment dataset with 5,000 observations across 3 variants:
python generate_sample_data.py| Variant | Conversion | Avg Revenue | Avg Session |
|---|---|---|---|
| control | 12.0% | $48.00 | 3.8 min |
| treatment_a | 15.0% | $54.50 | 4.1 min |
| treatment_b | 13.0% | $50.20 | 3.9 min |
| Method | Endpoint | Description |
|---|---|---|
| GET | /experiments | List experiments ordered by created_at desc |
| POST | /experiments | Create experiment with CSV upload |
| GET | /experiments/{id} | Fetch experiment with full results blob |
| DELETE | /experiments/{id} | Hard delete |
| POST | /stats/run/{id} | Dispatch test, auto-detect metric type, persist results |
| GET | /stats/results/{id} | Fetch stored results |
| POST | /stats/power | Standalone power analysis, returns n and power curve |
| POST | /upload/preview | Parse CSV, infer column types, 5-row preview |
| POST | /upload/validate | Validate config, return structured errors and warnings |
Metric type auto-detection: The stats router checks if all unique metric values are in {0, 1, 0.0, 1.0}. If true, binary test is dispatched. Otherwise continuous. Reduces misconfiguration without requiring explicit user input.
Post-hoc power analysis: Runs after the test using the control group's observed mean and std from the actual data, not just pre-specified parameters. Tells you whether the experiment was adequately powered for the effect size that was actually present.
JSON column for experiment data: Uploaded rows stored in a PostgreSQL JSON column, capped at 5,000 rows. Keeps the architecture self-contained without a separate object store. Tradeoff is no SQL-level querying inside the data, acceptable for this use case since all computation runs in-process.
UUID primary keys: Avoids leaking row counts through sequential integer IDs.