Skip to content

tmcarmichael/nn-observability

Repository files navigation

arXiv Zenodo CI Python License: MIT

Architectural Observability Collapse in Transformers

Read the paper (arXiv)

Transformer activations carry information about which tokens will be wrong that output confidence does not expose. Whether this signal exists at all depends on which model you deploy. Training can erase it while the model keeps getting better at its task.

A frozen linear probe, one dot product per token, reads this signal with no fine-tuning and no task-specific data. A probe trained on Wikipedia catches the same errors zero-shot on medical licensing questions and reading comprehension. Standard probing methodology overstates the signal by a factor of two to three: confidence controls absorb 60.3% of the raw probe signal across 14 models in 6 families. After controlling for confidence, the surviving signal is stable across 20 seeds and output-independent. A trained MLP on last-layer activations does not recover it.

Observability collapse

Whether this signal exists in a given model is determined before deployment. Under Pythia's controlled training, both matched-width configurations form the signal at the earliest measured checkpoint. Training then erases it in the (24L, 16H) class while perplexity improves monotonically in both configurations through the collapse. Architecture does not prevent the signal from appearing. It determines whether training preserves or erases it.

The result is observability collapse: the decision-quality signal that neither confidence nor output-layer predictors recover falls to the detection floor at every measured layer. The collapse survives the standard escape hatches: it is not layer choice, probe nonlinearity, underpowered training, or final-layer predictor capacity. Six other Pythia configurations stay healthy across a 170x parameter range.

The pattern replicates across families and training recipes. At matched 3B scale, Qwen and Llama differ by 2.9x with non-overlapping seed distributions. Mistral 7B preserves the signal where Llama 3.1 8B collapses despite similar architecture. The collapse map changes across recipes, but the phenomenon persists. Family membership explains 91% of variance at p = 0.003.

Implications

Monitorability has a ceiling set during training. A probe trained on Wikipedia, with no task-specific data, transfers zero-shot to SQuAD, MedQA, and TruthfulQA. It exclusively catches 10.9-13.4% of all errors at 20% flag rate, errors that confidence marks correct, across seven of nine downstream model-task cells. When observability collapses, no post-hoc probe design recovers healthy-range signal. Architecture selection is a monitoring decision.

This ceiling is invisible to standard evaluation. Raw probes can confuse confidence with decision quality. Output confidence is a lossy interface: it exposes a prediction and a score, but discards internal evidence about whether that prediction is fragile. Access to activations is not the same as access to useful internal evidence. A white-box model can still be unobservable if training failed to preserve the relevant signal. Predictive capability can improve while monitorability is destroyed. Model selection must evaluate what a model can do and what internal evidence it preserves for oversight. Observability becomes an evaluation dimension alongside accuracy, latency, cost, and calibration.

Representation geometry as a design target

The observable signal occupies a low-variance direction in representation space, nearly orthogonal to the dominant variance axes. The erasure is selective: some architecture-recipe configurations systematically push representation geometry toward structures where that direction cannot survive. These results turn representation geometry from a passive diagnostic object into an upstream design target that mediates tradeoffs between capability, interpretability, and monitorability. Architecture sets a geometric prior, training optimizes it, probing measures it, monitoring reads it out. Internal representation geometry becomes a first-class design variable, alongside loss, architecture, and data.

Two panels showing observability collapse in two training recipes. Left, Llama: 1B rises to +0.28, while 3B and 8B stay flat near +0.05 to +0.10 across all layers. Right, Pythia: six sizes peak between +0.20 and +0.38, while 410M and 1.4B (both 24 layers, 16 heads) stay flat near +0.10.

Both panels use the same protocol, the same token budget per hidden dimension, and the same shaded detection band. Left panel: Llama 3.2 under a cross-recipe split, where 1B preserves the signal and 3B and 8B do not. Right panel: Pythia under held-recipe training, where three of nine configurations collapse, all sharing 24 layers and 16 heads. The replication spans a 3.5x parameter gap, two Pile variants, and two hidden dimensions. No intermediate values appear.

Scatter of 26 models showing output-controlled residual on the y-axis against confidence-controlled partial correlation on the x-axis. Bootstrap linear fit slope ≈0.80, with collapse points near the origin for Llama 3B and 8B and three Pythia (24L, 16H) configurations.

26 models, seven families. The x-axis is pcorr (partial Spearman correlation between probe scores and per-token loss, controlling for confidence and activation norm). The y-axis is the output-controlled residual: what remains after also controlling for a trained last-layer predictor. Collapse points cluster near the origin. Where pcorr collapses, the surplus over output-side prediction vanishes with it.

What this repo contains

The code, data, and analysis behind the paper. Every paper-cited number traces to a committed JSON. reports/paper_values.json enumerates every macro the paper text uses; the headline values and the directly-readable subset carry full source_files + key_paths + formula annotations, and annotation coverage is locked to grow rather than regress. Every result JSON validates against a formal Draft 2020-12 schema in schema/. Every model revision in results/model_revisions.json is SHA-verified against the Hugging Face API. The full result-file inventory is published as a Croissant 1.1 metadata descriptor at croissant.json (validated against the official MLCommons spec).

git clone https://github.com/tmcarmichael/nn-observability
cd nn-observability
uv sync

uv run pytest tests/ -q
uv run python analysis/run_all.py

Verify a paper claim

Three independent paths, in increasing depth:

Path A: structured claim provenance. reports/paper_values.json carries every macro the paper cites. Every annotated entry includes source_files, key_paths, formula, and scope. The annotated subset includes every headline value the paper text references; the live counts (n_macros, n_macros_with_provenance) are at the top of the JSON, and the count cannot regress. Pick any annotated macro and walk the chain by hand:

import json
import numpy as np

pv = json.load(open("reports/paper_values.json"))
macro = next(m for m in pv["macros"] if m["name"] == "confabsorbmean")

deltas = []
for fname in macro["source_files"]:
    cs = json.load(open(f"results/{fname}"))["control_sensitivity"]
    deltas.append((cs["none"] - cs["standard"]) / cs["none"] * 100)

print(np.mean(deltas))

reports/scopes.json carries the named scopes (cross_family_14, pythia_controlled_9, etc.; membership is mirrored from analysis/load_results.py:SCOPES and locked by a drift test). reports/figure_sources.json maps each PDF figure to its source JSONs. schema/ holds Draft 2020-12 JSON Schemas for every result type, dispatched by filename pattern in scripts/validate_schemas.py:DISPATCH. tests/test_paper_values.py enforces the bridge contract end-to-end: every source_files entry exists, every key_paths resolves, every direct-read macro matches its JSON cell at the formatted precision, every result JSON validates against its dispatched schema, exporters are idempotent, paper_version matches main.tex, and macro coverage cannot regress below the locked floor.

Path B: targeted CLI verification. Pick a claim and run an analysis script:

Paper claim Value Command Source
Cross-family permutation F (family effect) p = 0.003 uv run python analysis/permutation_test.py cross_family_14 scope in analysis/load_results.py
Llama 1B partial correlation +0.286 uv run python analysis/load_results.py results/llama-3.2-1b_main.json
Exclusive catch rate at 20% flag rate (LM) 12-15% uv run python analysis/exclusive_catch_rates.py results/transformer_observe.json key 6a

Path C: full pipeline. uv run pytest tests/ -q runs the test suite (formal schema validation across every result type, scope membership, paper_values.json integrity, direct-read auto-verification, exporter idempotency, figure-source existence, manifest revision pinning across every model-loading script, and a Python compile gate for the script set). uv run python scripts/validate_schemas.py --strict validates every result JSON against its dispatched schema and exits non-zero on any unmatched file. The paper-side just check layers content diffs against every generated artifact on top. CI runs both on every push.

Reproduce an experiment

  1. Clone and sync: git clone https://github.com/tmcarmichael/nn-observability && cd nn-observability && uv sync --extra transformer

  2. Pull the canonical CUDA environment: docker pull runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 (PyTorch 2.8.0, CUDA 12.8.1, Ubuntu 24.04). VRAM should be sized for the target model.

  3. Run the full protocol on any Hugging Face model:

    uv run python scripts/run_model.py --model Qwen/Qwen2.5-7B --output qwen2.5-7b_main.json

    For the full Pythia controlled suite, run just pythia-suite (all configurations, sequential). Output is a self-contained JSON per model with full provenance: layer sweep, 7-seed evaluation, output-controlled residual, cross-domain transfer, control sensitivity, and flagging analysis. The manifests results/model_revisions.json and results/dataset_revisions.json pin Hugging Face model and eval-dataset SHAs and must be present. Every entry in model_revisions.json is programmatically verified against the Hugging Face API; the latest report under results/manifest_verification/ records the verification timestamp and exact-SHA-match status per entry, and is regeneratable via uv run --extra transformer python scripts/verify_manifest_revisions.py. For strict reproduction, run with HF_HOME=$(mktemp -d) so cached versions cannot override revision=; this isolates the run from your personal Hugging Face cache.

    Pile is the upstream training corpus for Pythia and Pythia-deduped and is not a reproduction dependency. Reviewers do not need to acquire Pile. Probing data is WikiText, pinned in dataset_revisions.json.

  4. Schema-validate the new JSON against the recorded protocol: just validate-results-strict.

To add a new model to the analysis scope, see analysis/README.md. Local development without a GPU is sufficient for uv run pytest tests/ and CPU analysis (analysis/run_all.py, analysis/exclusive_catch_rates.py).

Croissant metadata

A Croissant 1.1 metadata descriptor at croissant.json covers the full results inventory: a parent archive plus the model and dataset revision manifests and the latest verification report as cr:FileObject entries, one cr:FileSet per result file type (glob-matched against results/), and a cr:RecordSet per type with fields derived from schema/*.schema.json. The descriptor is regenerated by just croissant and gated by just check-croissant, which runs the official MLCommons validator. Spec: https://docs.mlcommons.org/croissant/docs/croissant-spec-1.1.html.

The parent FileObject's sha256 is a deterministic merkle hash over (filename, sha256) pairs of every distribution file. To verify dataset integrity, clone at the cited tag and rerun the generator: the regenerated croissant.json must match the committed file byte-for-byte.

Repository structure

src/        Core library (probe, observer, experiment engine).
scripts/    GPU experiment launchers, schema validator, exporters.
analysis/   CPU statistical analysis.
schema/     Draft 2020-12 JSON Schemas, one per result type.
results/    Result JSONs, manifests, verification reports.
reports/    Cross-repo claim provenance (paper_values, scopes, figure_sources).
tests/      Schema, drift, integrity, and contract gates.
assets/     Paper figures and share-ready PNGs.

Citation

Cite the paper and the code separately:

@misc{carmichael2026observability,
  title={Architectural Observability Collapse in Transformers},
  author={Carmichael, Thomas},
  year={2026},
  eprint={2604.24801},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arXiv.2604.24801},
  url={https://arxiv.org/abs/2604.24801}
}

@software{carmichael2026code,
  title={nn-observability: code for ``Architectural Observability Collapse in Transformers''},
  author={Carmichael, Thomas},
  year={2026},
  doi={10.5281/zenodo.19435674},
  url={https://github.com/tmcarmichael/nn-observability}
}

License

MIT License

About

Neural Network LLM research repo: Architecture determines whether training preserves or erases transformer decision quality signals.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors