Skip to content

SAE based hallucination detection and mitigation for LLMs.

License

Notifications You must be signed in to change notification settings

ariaxhan/neural-polygraph

Repository files navigation

Neural Polygraph

Detecting hallucinations in language models using Sparse Autoencoder (SAE) spectral signatures and geometric analysis.

Quick Start

# 1. Install
pip install -e .

# 2. Verify (quick check)
python verify_setup.py

# 3. Run experiment
python run_experiment.py 01_spectroscopy

# 4. Visualize
python experiments/visualize_spectroscopy.py

Note: Use python test_setup.py for a comprehensive verification of all dependencies.

Structure

neural-polygraph/
├── src/hallucination_detector/    # Core package
│   ├── sae_utils.py                # SAE feature extraction
│   ├── geometry.py                 # Geometric analysis
│   ├── data_loader.py              # HB-1000 benchmark loader
│   └── storage.py                  # Experiment storage
│
├── experiments/                    # Experiment protocols
│   ├── 01_spectroscopy.py          # Experiment A
│   ├── visualize_spectroscopy.py   # Visualization
│   └── data/                       # HB-1000 benchmark (~1000 samples)
│
├── run_experiment.py               # Universal runner
├── test_setup.py                   # Setup verification
└── TESTING-PLANS.MD                # Research plan

Usage

Run Experiments

# List available experiments
python run_experiment.py --list

# Run Experiment A: Spectroscopy
python run_experiment.py 01_spectroscopy

# View results
python run_experiment.py --view 01_spectroscopy

Programmatic Usage

from hallucination_detector import (
    HB_Benchmark,
    ExperimentStorage,
    compute_inertia_tensor,
)

# Load benchmark
benchmark = HB_Benchmark("experiments/data")
benchmark.load_datasets()
benchmark.load_model_and_sae(layer=5, width="16k")

# Get activations
activations = benchmark.get_activations("The Eiffel Tower is in Paris")
print(f"L0 Norm: {activations.l0_norm}")
print(f"Reconstruction Error: {activations.reconstruction_error:.4f}")

# Save results
from pathlib import Path
storage = ExperimentStorage(Path("experiments/my_experiment"))
storage.write_manifest({"experiment": "my_experiment"})
storage.write_metrics({"metric": [...]})

Experiments

Experiment A: Spectroscopy ✅

Goal: Demonstrate distinct spectral signatures of hallucinations

Metrics: L0 Norm, Reconstruction Error, Gini Coefficient

Run: python run_experiment.py 01_spectroscopy

Experiment B: Geometry 🚧

Goal: Measure the "shape" of thoughts using inertia tensors

Status: Coming soon

Experiment C: Ghost Features 🚧

Goal: Identify features unique to hallucinations

Status: Coming soon

Data: HB-1000 Benchmark

Dataset Samples Description
Entity Swaps 230 Geographic/entity errors
Temporal Shifts 270 Temporal errors
Logical Inversions 250 Logical flips
Adversarial Traps 250 High-probability misconceptions

Total: ~1,000 fact/hallucination pairs in experiments/data/

Dependencies

Core: torch, transformer-lens, sae-lens, numpy, polars

Viz: matplotlib, seaborn, plotly

Analysis: scikit-learn, umap-learn

See pyproject.toml for complete list.

Troubleshooting

Import errors: pip install -e .

Memory issues: Use CPU mode or smaller batches

Model download: Models download from Hugging Face (~2GB)

Test setup: python test_setup.py verifies everything

Research Plan

See TESTING-PLANS.MD for detailed experimental protocols and hypotheses.

License

MIT License

About

SAE based hallucination detection and mitigation for LLMs.

Resources

License

Stars

Watchers

Forks

Packages

No packages published