A/B Test & Experimentation Platform

A full-stack experimentation platform for designing, running, and analyzing controlled experiments. Covers the full lifecycle: power analysis before data collection, statistical inference after, and sequential monitoring during. Built from scratch with a custom statistical engine rather than relying on black-box libraries.

Architecture

┌─────────────────────────────────────────────────────────┐
│                     React Frontend                       │
│  Dashboard │ Experiment Wizard │ Results & Visualizations│
└─────────────────────┬───────────────────────────────────┘
                      │ REST (Axios)
┌─────────────────────▼───────────────────────────────────┐
│                   FastAPI Backend                        │
│                                                          │
│  /experiments    /stats/run    /stats/power   /upload    │
│                                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │              Statistical Engine                  │    │
│  │                                                  │    │
│  │  hypothesis_tests.py   power_analysis.py         │    │
│  │  sequential.py                                   │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────┬───────────────────────────────────┘
                      │ SQLAlchemy ORM
┌─────────────────────▼───────────────────────────────────┐
│                  PostgreSQL 15                           │
│         experiments table (JSON results column)         │
└─────────────────────────────────────────────────────────┘

Tech Stack

Layer	Technology
Frontend	React 18, Vite, React Router v6, Recharts, Axios
Backend	Python 3.11, FastAPI 0.111, Uvicorn
Database	PostgreSQL 15, SQLAlchemy 2.0, psycopg2
Statistics	SciPy 1.14, statsmodels 0.14, NumPy 2.1, pandas 2.2
Infrastructure	Docker, Docker Compose

Statistical Engine

1. Power Analysis

Pre-experiment sample size calculation. Given a baseline rate, minimum detectable effect (MDE), significance level α, and desired power 1-β, the required sample size per variant is:

$$n = \left(\frac{z_{\alpha/2} + z_{\beta}}{\delta}\right)^2$$

Where δ is the standardized effect size. For proportions, Cohen's h:

$$\delta = \frac{|p_2 - p_1|}{\sqrt{\frac{p_1(1-p_1) + p_2(1-p_2)}{2}}}$$

For continuous metrics, Cohen's d:

$$\delta = \frac{\mu_1 - \mu_0}{\sigma}$$

The platform generates a full power curve across a range of MDE values, surfacing the tradeoff between effect sensitivity and data requirements.

2. Proportions Test

Two-proportion z-test for binary outcomes (conversions, clicks). Under H₀: p₁ = p₂, the test statistic uses a pooled estimate of the common proportion:

$$\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$$

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$

The confidence interval uses unpooled SE because post-test we are estimating the actual difference between rates, not the null hypothesis value.

3. Continuous Metrics Test

Welch t-test for numeric metrics. Explicitly does not assume σ₁² = σ₂², which is almost never valid in practice. The test statistic:

$$t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

Degrees of freedom via Welch-Satterthwaite approximation:

$$\nu \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$

Using Student t-test when variances differ produces anti-conservative p-values. Welch is strictly more correct here.

4. Multivariant Testing (A/B/n)

Two-stage procedure to control Type I error inflation from multiple comparisons.

Stage 1 — Omnibus test

For proportions: Pearson chi-squared test on the full contingency table.

For continuous: One-way ANOVA via F-statistic:

$$F = \frac{\text{between-group variance}}{\text{within-group variance}} = \frac{\sum_i n_i (\bar{X}_i - \bar{X})^2 / (k-1)}{\sum_i \sum_j (X_{ij} - \bar{X}_i)^2 / (N-k)}$$

Stage 2 — Pairwise comparisons with correction

Running k pairwise tests at α each yields a family-wise error rate (FWER) of:

$$\text{FWER} = 1 - (1-\alpha)^k$$

At α=0.05 with 3 comparisons, FWER ≈ 14.3%. Holm-Bonferroni correction controls FWER at α without being as conservative as full Bonferroni. Procedure: sort p-values p₁ ≤ p₂ ≤ ... ≤ pₖ and reject H₀ᵢ if:

$$p_i \leq \frac{\alpha}{k - i + 1}$$

5. Sequential Testing — SPRT

Standard fixed-sample tests are statistically invalid if you peek at results mid-experiment and stop early. Each additional look inflates the false positive rate. The correct solution is Wald's Sequential Probability Ratio Test (1945).

At each observation, compute the log likelihood ratio between H₁ (effect exists) and H₀ (no effect). For Bernoulli observations:

$$\Lambda_n = \sum_{i=1}^{n} \left[ X_i \log\frac{p_1}{p_0} + (1-X_i)\log\frac{1-p_1}{1-p_0} \right]$$

Define two boundaries from the error rates α (Type I) and β (Type II):

$$A = \log\frac{1-\beta}{\alpha} \qquad B = \log\frac{\beta}{1-\alpha}$$

At each step:

If Λₙ ≥ A → reject H₀, declare treatment the winner
If Λₙ ≤ B → accept H₀, declare no effect
If B < Λₙ < A → continue collecting data

The platform tracks the full LLR trajectory and plots it against both boundaries in real time. Wald proved this procedure controls both error rates simultaneously while minimizing expected sample size under both hypotheses.

Project Structure

ab-test-platform/
├── backend/
│   ├── main.py                  # FastAPI app, CORS middleware, lifespan
│   ├── db.py                    # SQLAlchemy ORM models, session factory
│   ├── engine/
│   │   ├── __init__.py
│   │   ├── hypothesis_tests.py  # z-test, Welch t-test, ANOVA, Holm-Bonferroni
│   │   ├── power_analysis.py    # sample size calculator, MDE power curves
│   │   └── sequential.py        # SPRT with full LLR trajectory
│   ├── routers/
│   │   ├── experiments.py       # CRUD, CSV ingestion, JSON column storage
│   │   ├── stats.py             # test dispatch, metric type inference
│   │   └── upload.py            # CSV preview, column type inference, validation
│   └── requirements.txt
├── frontend/
│   └── src/
│       ├── pages/
│       │   ├── Dashboard.jsx     # experiment list, run/delete actions
│       │   ├── NewExperiment.jsx # 3-step wizard, auto column detection
│       │   └── Results.jsx       # stat cards, uplift chart, CI, SPRT plot, power curve
│       ├── api.js                # axios base client, all endpoint calls
│       └── App.jsx               # router, sidebar layout
├── data/
│   └── sample_experiment.csv    # 5000 rows, 3 variants, baked lift signals
├── generate_sample_data.py
└── docker-compose.yml

Running Locally

Prerequisites: Python 3.11, Node.js 18+, PostgreSQL 15

Backend

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

With Docker

docker-compose up

App at http://localhost:5173. API docs at http://localhost:8000/docs.

Sample Data

Synthetic experiment dataset with 5,000 observations across 3 variants:

python generate_sample_data.py

Variant	Conversion	Avg Revenue	Avg Session
control	12.0%	$48.00	3.8 min
treatment_a	15.0%	$54.50	4.1 min
treatment_b	13.0%	$50.20	3.9 min

API Reference

Method	Endpoint	Description
GET	/experiments	List experiments ordered by created_at desc
POST	/experiments	Create experiment with CSV upload
GET	/experiments/{id}	Fetch experiment with full results blob
DELETE	/experiments/{id}	Hard delete
POST	/stats/run/{id}	Dispatch test, auto-detect metric type, persist results
GET	/stats/results/{id}	Fetch stored results
POST	/stats/power	Standalone power analysis, returns n and power curve
POST	/upload/preview	Parse CSV, infer column types, 5-row preview
POST	/upload/validate	Validate config, return structured errors and warnings

Design Notes

Metric type auto-detection: The stats router checks if all unique metric values are in {0, 1, 0.0, 1.0}. If true, binary test is dispatched. Otherwise continuous. Reduces misconfiguration without requiring explicit user input.

Post-hoc power analysis: Runs after the test using the control group's observed mean and std from the actual data, not just pre-specified parameters. Tells you whether the experiment was adequately powered for the effect size that was actually present.

JSON column for experiment data: Uploaded rows stored in a PostgreSQL JSON column, capped at 5,000 rows. Keeps the architecture self-contained without a separate object store. Tradeoff is no SQL-level querying inside the data, acceptable for this use case since all computation runs in-process.

UUID primary keys: Avoids leaking row counts through sequential integer IDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A/B Test & Experimentation Platform

Architecture

Tech Stack

Statistical Engine

1. Power Analysis

2. Proportions Test

3. Continuous Metrics Test

4. Multivariant Testing (A/B/n)

5. Sequential Testing — SPRT

Project Structure

Running Locally

Sample Data

API Reference

Design Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
data		data
frontend		frontend
Readme.md		Readme.md
generate_sample_data.py		generate_sample_data.py

Folders and files

Latest commit

History

Repository files navigation

A/B Test & Experimentation Platform

Architecture

Tech Stack

Statistical Engine

1. Power Analysis

2. Proportions Test

3. Continuous Metrics Test

4. Multivariant Testing (A/B/n)

5. Sequential Testing — SPRT

Project Structure

Running Locally

Sample Data

API Reference

Design Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages