Specification-aware cross-jurisdictional education policy discourse analysis.
AtlasED is a production NLP pipeline that analyses education policy discourse across England, Scotland, and Ireland. It does two things:
-
Surfaces how different institutional actors frame educational disadvantage — tracking which organisations dominate the debate, how topic prominence varies across jurisdictions, and whether those patterns diverge systematically.
-
Makes the specification choices behind the analysis visible and auditable — surfacing how data selection, model selection, parameterisation, and preprocessing decisions shape what the analysis finds, with the goal of informing guidelines on how AI analysis should be used responsibly in public policy.
The pipeline uses four models to analyse and compare policy discourse:
- M1: NMF trained on England's corpus (k=30) — England's debate in its own vocabulary
- M2: NMF trained on Scotland's corpus (k=15) — Scotland's debate in its own vocabulary
- M3: NMF trained on Ireland's corpus (k=15) — Ireland's debate in its own vocabulary
- M4: BERTopic trained on all three countries combined — shared topics across jurisdictions using semantic embeddings
Topic alignment via cosine similarity enables cross-model comparison despite different topic counts. Each country's debate is represented on its own terms; the combined model shows where debates converge and diverge. An earlier England-as-baseline design (training on England, applying to Scotland/Ireland as inference) revealed that England's vocabulary acted as the implicit norm — this finding is retained as specification sensitivity evidence but is not the primary analytical framework.
- How do different institutional actors across England, Scotland, and Ireland operationalise the construct of educational disadvantage — and do those operationalisations diverge systematically?
- When a model trained on England's policy corpus is applied to Scottish and Irish corpora, is the resulting distributional drift directional and persistent?
- Do specification choices — construct definitions, preprocessing decisions, model parameters — determine findings independently of the underlying policy landscape?
~5,000 documents (post-cleaning) from government departments, think tanks, media outlets, professional bodies, and research organisations across three jurisdictions. Updated weekly.
| Jurisdiction | Raw | Training (post-cleaning) | Weekly inference (2026) | Model |
|---|---|---|---|---|
| England | 3,943 | 3,931 | 231 (weeks 1–11) | M1: NMF k=30 |
| Scotland | 511 | 481 | 120 (weeks 1–11) | M2: NMF k=15 |
| Ireland | 1,040 | 575 | 67 (weeks 1–11) | M3: NMF k=10 |
| Combined | ~5,494 | ~4,987 | — | M4: BERTopic |
Cleaning removes boilerplate (GOV.UK footers, FFT newsletter signups, gov.scot media enquiries footers) and drops articles with <200 characters of content after cleaning (e.g., gov.ie PDF landing pages with no inline text, ADES title-only articles). See the specification choices log for details on what is removed and why.
England sources: SchoolsWeek, UK Government, FFT Education Datalab, Education Policy Institute, Nuffield Foundation, Federation of Education. Scotland sources: Scottish Government, Children in Scotland, GTCS, ADES, SERA. Ireland sources: Irish Government, ESRI, Teaching Council, Education Research Centre, Education Matters.
Country is derived from the source organisation, not hardcoded.
Storage: Supabase/PostgreSQL. Raw text in articles_raw, topic assignments in articles_topics.
Each country gets its own NMF model trained on its own corpus with k determined by coherence sweep. This means each jurisdiction's debate is represented in its own vocabulary — Scotland discovers Scottish topics, Ireland discovers Irish topics, neither is measured as deviation from England.
- M1 (England): 3,931 articles (post-cleaning), k=30 (confirmed by coherence sweep and stability testing)
- M2 (Scotland): 481 articles (post-cleaning), k=15 (coherence sweep completed — coherence plateaus at k=15, 0.640)
- M3 (Ireland): 575 articles (post-cleaning), k=10 (coherence drops above k=10 due to gov.ie corpus homogeneity)
BERTopic trained on all three countries' corpora combined. Uses semantic embeddings which can bridge vocabulary gaps (e.g. clusters "ASN" with "SEND") that NMF's bag-of-words approach treats as separate. Auto-discovers topic count. [TBD — not yet built]
Cosine similarity between topic-word vectors enables cross-model comparison despite different k values. The dashboard shows where topics align (shared concerns) and where they're country-specific.
Jensen-Shannon divergence computed per country against its own training distribution. Drift monitoring runs monthly (weekly article volume is too low for reliable weekly scores). Per-jurisdiction drift trajectories stored in Supabase. Automated via GitHub Actions.
An earlier design trained on England only and applied to Scotland/Ireland as inference. This revealed that England's vocabulary acted as the implicit norm — Scottish and Irish documents were measured as deviation. These results are retained as specification sensitivity evidence: the same data analysed through a different training design produces different conclusions.
KL divergence (both directions, all jurisdiction pairs), balanced subsamples, and parameter perturbation are planned as robustness checks.
Three computable dimensions that make specification choices visible:
- Proxy concentration — how far a small number of construct proxies account for observed topic variance.
- Specification sensitivity — how far findings shift under perturbation of modelling decisions.
- Normative divergence — whether one jurisdiction's corpus systematically acts as the implicit baseline.
A fourth dimension (recourse quality) was designed, tested, and scoped out of the current implementation. Inter-rater agreement was unreliable at this stage.
AM1_topic_modelling/
├── config.yaml # All tunable parameters
├── sync_from_supabase.py # Pull data from Supabase to local CSVs
├── run_weekly.py # Weekly pipeline: sync → inference → write
├── run_monthly_drift.py # Monthly drift monitoring
├── Dockerfile # API deployment container
├── data/
│ ├── training/ # Training CSVs — all three countries (synced, gitignored)
│ ├── inference/ # Weekly inference CSVs (synced, gitignored)
│ └── evaluation_outputs/ # Coherence, stability, topic comparison CSVs
├── analysis/
│ ├── training/ # Model training (all countries)
│ │ ├── s01_data_loader.py # Load from CSV
│ │ ├── s02_cleaning.py # Structural text cleaning
│ │ ├── s03_spacy_processing.py # Lemmatisation, POS filtering, stopwords
│ │ ├── s04_vectorisation.py # TF-IDF
│ │ ├── s05_nmf_training.py # NMF model fitting
│ │ ├── s06_topic_allocation.py # Assign topics to documents
│ │ ├── s07_evaluation.py # Coherence + stability
│ │ ├── s08_save_outputs.py # Versioned run artifacts
│ │ ├── s09_mlflow_logging.py # Experiment tracking
│ │ ├── s10_pipeline.py # Training orchestrator
│ │ └── s11_supabase_writer.py # Write results to DB
│ ├── inference/ # Weekly + backfill inference
│ │ ├── batch_runner.py # NMF inference (all countries)
│ │ └── drift_monitor.py # Per-jurisdiction JS divergence
│ ├── api/ # FastAPI serving layer
│ │ ├── main.py
│ │ └── model_loader.py
│ └── dashboard/ # Streamlit application
│ ├── app.py # Overview
│ ├── supabase_loader.py # Data loading + caching
│ └── pages/
│ ├── 1_Topic_Explorer.py
│ ├── 2_Trends.py
│ ├── 3_Organisations.py
│ └── 4_Framing_Analysis.py
├── experiments/
│ ├── outputs/runs/ # Versioned model artifacts (gitignored)
│ ├── notebooks/ # Training + EDA notebooks
│ └── mlruns/ # MLflow experiment store (gitignored)
├── tests/
│ └── test_pipeline.py # Smoke tests
└── .github/workflows/
├── weekly_inference.yml # Saturday 8am: sync → inference
└── monthly_drift.yml # 1st of month: drift monitoring
| Component | Technology |
|---|---|
| Topic modelling | scikit-learn NMF |
| NLP preprocessing | spaCy (en_core_web_sm) |
| Vectorisation | TF-IDF |
| API | FastAPI + Pydantic |
| Dashboard | Streamlit |
| Database | Supabase (PostgreSQL) |
| Experiment tracking | MLflow |
| CI/CD | GitHub Actions (weekly inference, monthly drift) |
| Deployment | Docker, Render |
Every modelling decision in this pipeline is a specification choice. The following are logged, surfaced on the dashboard, and tested under perturbation:
| Choice | Current setting | Why it matters |
|---|---|---|
| Training corpus | Per-country (England, Scotland, Ireland each trained separately) | Each country's topic space reflects its own vocabulary. |
| Preprocessing | spaCy en_core_web_sm |
English-language model applied to Scottish/Irish policy text. |
| Model | NMF (baseline). BERTopic comparison planned. | Different models surface different topic structures. |
| Number of topics (k) | 30 | Varied 5–50 in coherence sweep. k=25 and k=35 qualitatively reviewed. |
| TF-IDF parameters | min_df=3, max_df=0.85, max_features=3000 | Controls what vocabulary enters the model. |
| Topic naming | LLM-generated from top keywords | Reproducible but not neutral. |
| Source selection | 6 eng, 5 sco, 6 irl organisations | Who is in the corpus determines what the model finds. |
| Date range | Jan 2023 – present | A political choice, not a neutral setting. |
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_smpython -m analysis.training.s10_pipelinepython -m analysis.inference.batch_runnerstreamlit run analysis/dashboard/app.pyuvicorn analysis.api.main:app --reloadAll parameters in config.yaml:
data:
dataset_name: eng_training
source: supabase
training_country: eng
nmf:
n_topics: 30
random_state: 42
init: nndsvd
max_iter: 1000
tfidf:
min_df: 3
max_df: 0.85
max_features: 3000
ngram_range: [1, 2]
inference:
model_type: nmf
countries: [eng, sco, irl]
drift:
cadence: monthly
baseline: eng_trainingThe specification scoring layer is grounded in three independent theoretical traditions:
Proxy concentration derives from the construct validity literature. Jacobs and Wallach (2021) argue that fairness is a property of measurement models, not algorithms — the question of what a system is actually measuring is upstream of the question of whether it measures it fairly.
Specification sensitivity operationalises the replication crisis argument. Botvinik-Nezer et al. (Nature, 2020) showed that seventy independent teams given identical fMRI data reached different conclusions because of unlogged pipeline choices.
Normative divergence derives from the situated knowledge tradition. D'Ignazio and Klein's Data Feminism (2020) operationalises the argument that data systems encode the perspectives of their designers. Normative divergence makes that encoding computable.
In scope: Text-based policy corpora. Public artefacts. Design-time specification choices.
Out of scope: Image-based systems. Real-time decision pipelines without document corpora. Runtime specification logging for agentic systems. Recourse quality as a computable metric (scoped out of v1).
Known limitations:
- England dominates the corpus by volume. This is surfaced as a finding, not corrected.
- spaCy
en_core_web_smis an English-language model applied to all three jurisdictions. The vocabulary divergence is audited and reported. - Topic labels are a specification choice. They are generated by LLM from top keywords, not ground truth.
- The pipeline finds patterns in language. It cannot determine whether those patterns reflect policy reality or corpus construction. Both possibilities are surfaced.
- BERTopic comparison — model swap robustness check using sentence-transformer embeddings. Tests whether cross-jurisdictional findings hold under a different modelling approach.
- KL divergence analysis — asymmetric divergence computation across all jurisdiction pairs to quantify normative dominance.
- Specification scoring layer — three computable dimensions (proxy concentration, specification sensitivity, normative divergence) extracted from pipeline outputs.
- RAG chatbot — LangGraph agent allowing stakeholders to query the dataset in natural language, with Langfuse observability and Inspect evaluation.
- Build Your Model — interactive dashboard page where stakeholders change parameters and observe how findings shift.
- Cross-domain application — testing the specification scoring approach in health, criminal justice, and immigration policy.
UCL Institute of Education | Level 6 AI Engineering Apprenticeship | 2025–2026