AtlasED

Specification-aware cross-jurisdictional education policy discourse analysis.

AtlasED is a production NLP pipeline that analyses education policy discourse across England, Scotland, and Ireland. It does two things:

Surfaces how different institutional actors frame educational disadvantage — tracking which organisations dominate the debate, how topic prominence varies across jurisdictions, and whether those patterns diverge systematically.
Makes the specification choices behind the analysis visible and auditable — surfacing how data selection, model selection, parameterisation, and preprocessing decisions shape what the analysis finds, with the goal of informing guidelines on how AI analysis should be used responsibly in public policy.

The pipeline uses four models to analyse and compare policy discourse:

M1: NMF trained on England's corpus (k=30) — England's debate in its own vocabulary
M2: NMF trained on Scotland's corpus (k=15) — Scotland's debate in its own vocabulary
M3: NMF trained on Ireland's corpus (k=15) — Ireland's debate in its own vocabulary
M4: BERTopic trained on all three countries combined — shared topics across jurisdictions using semantic embeddings

Topic alignment via cosine similarity enables cross-model comparison despite different topic counts. Each country's debate is represented on its own terms; the combined model shows where debates converge and diverge. An earlier England-as-baseline design (training on England, applying to Scotland/Ireland as inference) revealed that England's vocabulary acted as the implicit norm — this finding is retained as specification sensitivity evidence but is not the primary analytical framework.

Research Questions

How do different institutional actors across England, Scotland, and Ireland operationalise the construct of educational disadvantage — and do those operationalisations diverge systematically?
When a model trained on England's policy corpus is applied to Scottish and Irish corpora, is the resulting distributional drift directional and persistent?
Do specification choices — construct definitions, preprocessing decisions, model parameters — determine findings independently of the underlying policy landscape?

Data

~5,000 documents (post-cleaning) from government departments, think tanks, media outlets, professional bodies, and research organisations across three jurisdictions. Updated weekly.

Jurisdiction	Raw	Training (post-cleaning)	Weekly inference (2026)	Model
England	3,943	3,931	231 (weeks 1–11)	M1: NMF k=30
Scotland	511	481	120 (weeks 1–11)	M2: NMF k=15
Ireland	1,040	575	67 (weeks 1–11)	M3: NMF k=10
Combined	~5,494	~4,987	—	M4: BERTopic

Cleaning removes boilerplate (GOV.UK footers, FFT newsletter signups, gov.scot media enquiries footers) and drops articles with <200 characters of content after cleaning (e.g., gov.ie PDF landing pages with no inline text, ADES title-only articles). See the specification choices log for details on what is removed and why.

England sources: SchoolsWeek, UK Government, FFT Education Datalab, Education Policy Institute, Nuffield Foundation, Federation of Education. Scotland sources: Scottish Government, Children in Scotland, GTCS, ADES, SERA. Ireland sources: Irish Government, ESRI, Teaching Council, Education Research Centre, Education Matters.

Country is derived from the source organisation, not hardcoded.

Storage: Supabase/PostgreSQL. Raw text in articles_raw, topic assignments in articles_topics.

Architecture

Per-country NMF models (M1, M2, M3)

Each country gets its own NMF model trained on its own corpus with k determined by coherence sweep. This means each jurisdiction's debate is represented in its own vocabulary — Scotland discovers Scottish topics, Ireland discovers Irish topics, neither is measured as deviation from England.

M1 (England): 3,931 articles (post-cleaning), k=30 (confirmed by coherence sweep and stability testing)
M2 (Scotland): 481 articles (post-cleaning), k=15 (coherence sweep completed — coherence plateaus at k=15, 0.640)
M3 (Ireland): 575 articles (post-cleaning), k=10 (coherence drops above k=10 due to gov.ie corpus homogeneity)

Combined BERTopic model (M4)

BERTopic trained on all three countries' corpora combined. Uses semantic embeddings which can bridge vocabulary gaps (e.g. clusters "ASN" with "SEND") that NMF's bag-of-words approach treats as separate. Auto-discovers topic count. [TBD — not yet built]

Topic alignment

Cosine similarity between topic-word vectors enables cross-model comparison despite different k values. The dashboard shows where topics align (shared concerns) and where they're country-specific.

Drift detection

Jensen-Shannon divergence computed per country against its own training distribution. Drift monitoring runs monthly (weekly article volume is too low for reliable weekly scores). Per-jurisdiction drift trajectories stored in Supabase. Automated via GitHub Actions.

England-as-baseline (specification sensitivity evidence)

An earlier design trained on England only and applied to Scotland/Ireland as inference. This revealed that England's vocabulary acted as the implicit norm — Scottish and Irish documents were measured as deviation. These results are retained as specification sensitivity evidence: the same data analysed through a different training design produces different conclusions.

Cross-jurisdiction distributional analysis (in progress)

KL divergence (both directions, all jurisdiction pairs), balanced subsamples, and parameter perturbation are planned as robustness checks.

Specification scoring layer

Three computable dimensions that make specification choices visible:

Proxy concentration — how far a small number of construct proxies account for observed topic variance.
Specification sensitivity — how far findings shift under perturbation of modelling decisions.
Normative divergence — whether one jurisdiction's corpus systematically acts as the implicit baseline.

A fourth dimension (recourse quality) was designed, tested, and scoped out of the current implementation. Inter-rater agreement was unreliable at this stage.

Project Structure

AM1_topic_modelling/
├── config.yaml                              # All tunable parameters
├── sync_from_supabase.py                    # Pull data from Supabase to local CSVs
├── run_weekly.py                            # Weekly pipeline: sync → inference → write
├── run_monthly_drift.py                     # Monthly drift monitoring
├── Dockerfile                               # API deployment container
├── data/
│   ├── training/                            # Training CSVs — all three countries (synced, gitignored)
│   ├── inference/                           # Weekly inference CSVs (synced, gitignored)
│   └── evaluation_outputs/                  # Coherence, stability, topic comparison CSVs
├── analysis/
│   ├── training/                            # Model training (all countries)
│   │   ├── s01_data_loader.py               #   Load from CSV
│   │   ├── s02_cleaning.py                  #   Structural text cleaning
│   │   ├── s03_spacy_processing.py          #   Lemmatisation, POS filtering, stopwords
│   │   ├── s04_vectorisation.py             #   TF-IDF
│   │   ├── s05_nmf_training.py              #   NMF model fitting
│   │   ├── s06_topic_allocation.py          #   Assign topics to documents
│   │   ├── s07_evaluation.py                #   Coherence + stability
│   │   ├── s08_save_outputs.py              #   Versioned run artifacts
│   │   ├── s09_mlflow_logging.py            #   Experiment tracking
│   │   ├── s10_pipeline.py                  #   Training orchestrator
│   │   └── s11_supabase_writer.py           #   Write results to DB
│   ├── inference/                           # Weekly + backfill inference
│   │   ├── batch_runner.py                  #   NMF inference (all countries)
│   │   └── drift_monitor.py                 #   Per-jurisdiction JS divergence
│   ├── api/                                 # FastAPI serving layer
│   │   ├── main.py
│   │   └── model_loader.py
│   └── dashboard/                           # Streamlit application
│       ├── app.py                           #   Overview
│       ├── supabase_loader.py               #   Data loading + caching
│       └── pages/
│           ├── 1_Topic_Explorer.py
│           ├── 2_Trends.py
│           ├── 3_Organisations.py
│           └── 4_Framing_Analysis.py
├── experiments/
│   ├── outputs/runs/                        # Versioned model artifacts (gitignored)
│   ├── notebooks/                           # Training + EDA notebooks
│   └── mlruns/                              # MLflow experiment store (gitignored)
├── tests/
│   └── test_pipeline.py                     # Smoke tests
└── .github/workflows/
    ├── weekly_inference.yml                 # Saturday 8am: sync → inference
    └── monthly_drift.yml                    # 1st of month: drift monitoring

Tech Stack

Component	Technology
Topic modelling	scikit-learn NMF
NLP preprocessing	spaCy (`en_core_web_sm`)
Vectorisation	TF-IDF
API	FastAPI + Pydantic
Dashboard	Streamlit
Database	Supabase (PostgreSQL)
Experiment tracking	MLflow
CI/CD	GitHub Actions (weekly inference, monthly drift)
Deployment	Docker, Render

Specification Choices as First-Class Outputs

Every modelling decision in this pipeline is a specification choice. The following are logged, surfaced on the dashboard, and tested under perturbation:

Choice	Current setting	Why it matters
Training corpus	Per-country (England, Scotland, Ireland each trained separately)	Each country's topic space reflects its own vocabulary.
Preprocessing	spaCy `en_core_web_sm`	English-language model applied to Scottish/Irish policy text.
Model	NMF (baseline). BERTopic comparison planned.	Different models surface different topic structures.
Number of topics (k)	30	Varied 5–50 in coherence sweep. k=25 and k=35 qualitatively reviewed.
TF-IDF parameters	min_df=3, max_df=0.85, max_features=3000	Controls what vocabulary enters the model.
Topic naming	LLM-generated from top keywords	Reproducible but not neutral.
Source selection	6 eng, 5 sco, 6 irl organisations	Who is in the corpus determines what the model finds.
Date range	Jan 2023 – present	A political choice, not a neutral setting.

Getting Started

Install

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Train the model

python -m analysis.training.s10_pipeline

Run inference

python -m analysis.inference.batch_runner

Launch the dashboard

streamlit run analysis/dashboard/app.py

Start the API

uvicorn analysis.api.main:app --reload

Configuration

All parameters in config.yaml:

data:
  dataset_name: eng_training
  source: supabase
  training_country: eng

nmf:
  n_topics: 30
  random_state: 42
  init: nndsvd
  max_iter: 1000

tfidf:
  min_df: 3
  max_df: 0.85
  max_features: 3000
  ngram_range: [1, 2]

inference:
  model_type: nmf
  countries: [eng, sco, irl]

drift:
  cadence: monthly
  baseline: eng_training

Theoretical Foundations

The specification scoring layer is grounded in three independent theoretical traditions:

Proxy concentration derives from the construct validity literature. Jacobs and Wallach (2021) argue that fairness is a property of measurement models, not algorithms — the question of what a system is actually measuring is upstream of the question of whether it measures it fairly.

Specification sensitivity operationalises the replication crisis argument. Botvinik-Nezer et al. (Nature, 2020) showed that seventy independent teams given identical fMRI data reached different conclusions because of unlogged pipeline choices.

Normative divergence derives from the situated knowledge tradition. D'Ignazio and Klein's Data Feminism (2020) operationalises the argument that data systems encode the perspectives of their designers. Normative divergence makes that encoding computable.

Scope and Limitations

In scope: Text-based policy corpora. Public artefacts. Design-time specification choices.

Out of scope: Image-based systems. Real-time decision pipelines without document corpora. Runtime specification logging for agentic systems. Recourse quality as a computable metric (scoped out of v1).

Known limitations:

England dominates the corpus by volume. This is surfaced as a finding, not corrected.
spaCy en_core_web_sm is an English-language model applied to all three jurisdictions. The vocabulary divergence is audited and reported.
Topic labels are a specification choice. They are generated by LLM from top keywords, not ground truth.
The pipeline finds patterns in language. It cannot determine whether those patterns reflect policy reality or corpus construction. Both possibilities are surfaced.

Planned Extensions

BERTopic comparison — model swap robustness check using sentence-transformer embeddings. Tests whether cross-jurisdictional findings hold under a different modelling approach.
KL divergence analysis — asymmetric divergence computation across all jurisdiction pairs to quantify normative dominance.
Specification scoring layer — three computable dimensions (proxy concentration, specification sensitivity, normative divergence) extracted from pipeline outputs.
RAG chatbot — LangGraph agent allowing stakeholders to query the dataset in natural language, with Langfuse observability and Inspect evaluation.
Build Your Model — interactive dashboard page where stakeholders change parameters and observe how findings shift.
Cross-domain application — testing the specification scoring approach in health, criminal justice, and immigration policy.

Licence

MIT

UCL Institute of Education | Level 6 AI Engineering Apprenticeship | 2025–2026

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github/workflows		.github/workflows
.vscode		.vscode
analysis		analysis
dashboard		dashboard
dashboard_export		dashboard_export
docs		docs
experiments		experiments
planning		planning
reports		reports
scraping		scraping
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements-analysis.txt		requirements-analysis.txt
requirements-api.txt		requirements-api.txt
requirements-scraping.txt		requirements-scraping.txt
run_monthly_drift.py		run_monthly_drift.py
run_weekly.py		run_weekly.py
sync_from_supabase.py		sync_from_supabase.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AtlasED

Research Questions

Data

Architecture

Per-country NMF models (M1, M2, M3)

Combined BERTopic model (M4)

Topic alignment

Drift detection

England-as-baseline (specification sensitivity evidence)

Cross-jurisdiction distributional analysis (in progress)

Specification scoring layer

Project Structure

Tech Stack

Specification Choices as First-Class Outputs

Getting Started

Install

Train the model

Run inference

Launch the dashboard

Start the API

Configuration

Theoretical Foundations

Scope and Limitations

Planned Extensions

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AtlasED

Research Questions

Data

Architecture

Per-country NMF models (M1, M2, M3)

Combined BERTopic model (M4)

Topic alignment

Drift detection

England-as-baseline (specification sensitivity evidence)

Cross-jurisdiction distributional analysis (in progress)

Specification scoring layer

Project Structure

Tech Stack

Specification Choices as First-Class Outputs

Getting Started

Install

Train the model

Run inference

Launch the dashboard

Start the API

Configuration

Theoretical Foundations

Scope and Limitations

Planned Extensions

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages