Skip to content

hrithikda/ab-test-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A/B Test & Experimentation Platform

A full-stack experimentation platform for designing, running, and analyzing controlled experiments. Covers the full lifecycle: power analysis before data collection, statistical inference after, and sequential monitoring during. Built from scratch with a custom statistical engine rather than relying on black-box libraries.

Architecture

┌─────────────────────────────────────────────────────────┐
│                     React Frontend                       │
│  Dashboard │ Experiment Wizard │ Results & Visualizations│
└─────────────────────┬───────────────────────────────────┘
                      │ REST (Axios)
┌─────────────────────▼───────────────────────────────────┐
│                   FastAPI Backend                        │
│                                                          │
│  /experiments    /stats/run    /stats/power   /upload    │
│                                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │              Statistical Engine                  │    │
│  │                                                  │    │
│  │  hypothesis_tests.py   power_analysis.py         │    │
│  │  sequential.py                                   │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────┬───────────────────────────────────┘
                      │ SQLAlchemy ORM
┌─────────────────────▼───────────────────────────────────┐
│                  PostgreSQL 15                           │
│         experiments table (JSON results column)         │
└─────────────────────────────────────────────────────────┘

Tech Stack

Layer Technology
Frontend React 18, Vite, React Router v6, Recharts, Axios
Backend Python 3.11, FastAPI 0.111, Uvicorn
Database PostgreSQL 15, SQLAlchemy 2.0, psycopg2
Statistics SciPy 1.14, statsmodels 0.14, NumPy 2.1, pandas 2.2
Infrastructure Docker, Docker Compose

Statistical Engine

1. Power Analysis

Pre-experiment sample size calculation. Given a baseline rate, minimum detectable effect (MDE), significance level α, and desired power 1-β, the required sample size per variant is:

$$n = \left(\frac{z_{\alpha/2} + z_{\beta}}{\delta}\right)^2$$

Where δ is the standardized effect size. For proportions, Cohen's h:

$$\delta = \frac{|p_2 - p_1|}{\sqrt{\frac{p_1(1-p_1) + p_2(1-p_2)}{2}}}$$

For continuous metrics, Cohen's d:

$$\delta = \frac{\mu_1 - \mu_0}{\sigma}$$

The platform generates a full power curve across a range of MDE values, surfacing the tradeoff between effect sensitivity and data requirements.

2. Proportions Test

Two-proportion z-test for binary outcomes (conversions, clicks). Under H₀: p₁ = p₂, the test statistic uses a pooled estimate of the common proportion:

$$\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$$

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$

The confidence interval uses unpooled SE because post-test we are estimating the actual difference between rates, not the null hypothesis value.

3. Continuous Metrics Test

Welch t-test for numeric metrics. Explicitly does not assume σ₁² = σ₂², which is almost never valid in practice. The test statistic:

$$t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

Degrees of freedom via Welch-Satterthwaite approximation:

$$\nu \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$

Using Student t-test when variances differ produces anti-conservative p-values. Welch is strictly more correct here.

4. Multivariant Testing (A/B/n)

Two-stage procedure to control Type I error inflation from multiple comparisons.

Stage 1 — Omnibus test

For proportions: Pearson chi-squared test on the full contingency table.

For continuous: One-way ANOVA via F-statistic:

$$F = \frac{\text{between-group variance}}{\text{within-group variance}} = \frac{\sum_i n_i (\bar{X}_i - \bar{X})^2 / (k-1)}{\sum_i \sum_j (X_{ij} - \bar{X}_i)^2 / (N-k)}$$

Stage 2 — Pairwise comparisons with correction

Running k pairwise tests at α each yields a family-wise error rate (FWER) of:

$$\text{FWER} = 1 - (1-\alpha)^k$$

At α=0.05 with 3 comparisons, FWER ≈ 14.3%. Holm-Bonferroni correction controls FWER at α without being as conservative as full Bonferroni. Procedure: sort p-values p₁ ≤ p₂ ≤ ... ≤ pₖ and reject H₀ᵢ if:

$$p_i \leq \frac{\alpha}{k - i + 1}$$

5. Sequential Testing — SPRT

Standard fixed-sample tests are statistically invalid if you peek at results mid-experiment and stop early. Each additional look inflates the false positive rate. The correct solution is Wald's Sequential Probability Ratio Test (1945).

At each observation, compute the log likelihood ratio between H₁ (effect exists) and H₀ (no effect). For Bernoulli observations:

$$\Lambda_n = \sum_{i=1}^{n} \left[ X_i \log\frac{p_1}{p_0} + (1-X_i)\log\frac{1-p_1}{1-p_0} \right]$$

Define two boundaries from the error rates α (Type I) and β (Type II):

$$A = \log\frac{1-\beta}{\alpha} \qquad B = \log\frac{\beta}{1-\alpha}$$

At each step:

  • If Λₙ ≥ A → reject H₀, declare treatment the winner
  • If Λₙ ≤ B → accept H₀, declare no effect
  • If B < Λₙ < A → continue collecting data

The platform tracks the full LLR trajectory and plots it against both boundaries in real time. Wald proved this procedure controls both error rates simultaneously while minimizing expected sample size under both hypotheses.

Project Structure

ab-test-platform/
├── backend/
│   ├── main.py                  # FastAPI app, CORS middleware, lifespan
│   ├── db.py                    # SQLAlchemy ORM models, session factory
│   ├── engine/
│   │   ├── __init__.py
│   │   ├── hypothesis_tests.py  # z-test, Welch t-test, ANOVA, Holm-Bonferroni
│   │   ├── power_analysis.py    # sample size calculator, MDE power curves
│   │   └── sequential.py        # SPRT with full LLR trajectory
│   ├── routers/
│   │   ├── experiments.py       # CRUD, CSV ingestion, JSON column storage
│   │   ├── stats.py             # test dispatch, metric type inference
│   │   └── upload.py            # CSV preview, column type inference, validation
│   └── requirements.txt
├── frontend/
│   └── src/
│       ├── pages/
│       │   ├── Dashboard.jsx     # experiment list, run/delete actions
│       │   ├── NewExperiment.jsx # 3-step wizard, auto column detection
│       │   └── Results.jsx       # stat cards, uplift chart, CI, SPRT plot, power curve
│       ├── api.js                # axios base client, all endpoint calls
│       └── App.jsx               # router, sidebar layout
├── data/
│   └── sample_experiment.csv    # 5000 rows, 3 variants, baked lift signals
├── generate_sample_data.py
└── docker-compose.yml

Running Locally

Prerequisites: Python 3.11, Node.js 18+, PostgreSQL 15

Backend

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

Frontend

cd frontend
npm install
npm run dev

With Docker

docker-compose up

App at http://localhost:5173. API docs at http://localhost:8000/docs.

Sample Data

Synthetic experiment dataset with 5,000 observations across 3 variants:

python generate_sample_data.py
Variant Conversion Avg Revenue Avg Session
control 12.0% $48.00 3.8 min
treatment_a 15.0% $54.50 4.1 min
treatment_b 13.0% $50.20 3.9 min

API Reference

Method Endpoint Description
GET /experiments List experiments ordered by created_at desc
POST /experiments Create experiment with CSV upload
GET /experiments/{id} Fetch experiment with full results blob
DELETE /experiments/{id} Hard delete
POST /stats/run/{id} Dispatch test, auto-detect metric type, persist results
GET /stats/results/{id} Fetch stored results
POST /stats/power Standalone power analysis, returns n and power curve
POST /upload/preview Parse CSV, infer column types, 5-row preview
POST /upload/validate Validate config, return structured errors and warnings

Design Notes

Metric type auto-detection: The stats router checks if all unique metric values are in {0, 1, 0.0, 1.0}. If true, binary test is dispatched. Otherwise continuous. Reduces misconfiguration without requiring explicit user input.

Post-hoc power analysis: Runs after the test using the control group's observed mean and std from the actual data, not just pre-specified parameters. Tells you whether the experiment was adequately powered for the effect size that was actually present.

JSON column for experiment data: Uploaded rows stored in a PostgreSQL JSON column, capped at 5,000 rows. Keeps the architecture self-contained without a separate object store. Tradeoff is no SQL-level querying inside the data, acceptable for this use case since all computation runs in-process.

UUID primary keys: Avoids leaking row counts through sequential integer IDs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors