A geometric benchmark that isolates attention — selective filtering, sustained tracking, and adaptive updating — from raw reasoning ability.
The math at every step is trivial (vector addition, midpoints, rotation matrices). The challenge is maintaining an accurate internal representation across long, noisy, and disrupted sequences. Because every coordinate is defined by a chain of explicit operations, ground truth is deterministically computable — no LLM judge, no semantic ambiguity.
- Core Idea
- Three Attention Axes
- Scenario Generation
- Geometric Primitives
- Transforms
- Scoring
- Response Parsing
- Benchmark Tasks
- Project Structure
- Testing
- Running on Kaggle
A scenario is a sequence of statements (point definitions and transforms) followed by queries, presented to the model in natural language. Every point is defined by a chain of operations on existing points, so the dependency DAG is the ground truth. Distractors are provably distracting; the causal chain is provably load-bearing.
Four properties no text benchmark has:
- Verifiable closed-form ground truth — no LLM judge.
- Contamination-proof — fresh per seed, effectively infinite.
- Orthogonal difficulty knobs — one attention axis varies while others stay pinned.
- Graded scoring — Euclidean error binned into tiers, not binary.
Each axis maps to one parameter that is varied while everything else is randomized or pinned.
| Axis | Knob | What it tests | Cognitive analogue |
|---|---|---|---|
| Selective | num_points (noise) |
Filtering provably-irrelevant points out of the dependency chain | Cherry's cocktail-party effect (1953) |
| Sustained | max_chain_depth |
Maintaining a coherent numeric state through a long causal chain | SART vigilance decrement (Robertson 1997) |
| Shifting | transform_prob |
Detecting mid-scenario redefinitions and updating the cached representation | Task-switching cost (Monsell 2003) |
Scenarios are built by a single generative loop — no pre-planning, no separate "distractor" concept. At each step the generator either:
- Adds a point (relative to existing points)
- Applies a transform (rotates, translates, reflects, or scales a subset)
- Emits a query
Generation is fully deterministic from ScenarioConfig + seed. No randomness at evaluation time.
@dataclass
class ScenarioConfig:
dim: int = 3
min_chain_depth: int = 3
max_chain_depth: int = 7
leaf_bias: float = 0.5 # 1.0 = pure chain, 0.0 = bushy
num_points: int = 8 # selective-attention knob
transform_prob: float = 0.1 # shifting-attention knob
transform_types: list[str]
query_types: list[QueryType] # POSITION, DISTANCE, BOOLEAN
num_queries: int = 3
point_def_types: list[str]
coord_min/max, magnitude_min/max
query_target_depth | query_min_depth
seed: int | NoneScenarioConfig.randomize_except(rng, pinned={...}) generates diverse backgrounds while pinning only the variable under test. The same seed index produces the same randomized background across difficulty levels — so any score delta is attributable to the pinned axis, not background variance.
Phase 1 (depth ramp): Force linear growth until min_chain_depth is reached.
Phase 2 (generative): Add points (via leaf_bias) and transforms (via transform_prob).
Phase 3 (queries): Emit queries targeting deep / specific points.
Process the following 3D spatial reasoning scenario. Statements are
chronological — propagate all transformations before answering.
Format: [Answer q_ID] value — e.g. [Answer q_001] (0.0, 0.0, 0.0)
Point A is 1.8 units from Point O at angle 267 degrees.
Point B is at offset (-3.1, -1.9, 0.0) from Point A.
Point C is 5.5 units from Point B in the direction (-0.3, -0.9, 1.4).
Rotate Point B, Point C by 120 degrees around (0,0,0) on axis (1,0,0).
Point D is the midpoint of Point A, Point C.
[Query q_001] Position of D? (x, y, z)
All scenarios live in Cartesian 2D or 3D space. Origin O is the implicit zero vector. Angles are degrees in prompts, radians internally.
| Type | Example | Form |
|---|---|---|
cartesian_offset |
"B is at offset (3, -1, 2) from A." | target = anchor + offset |
magnitude_direction |
"B is 5.0 units from A in direction (1, 1, 0)." | target = anchor + m·normalize(d) |
magnitude_polar (2D) |
"C is 4.0 units from A at angle 60°." | target = anchor + m·[cos θ, sin θ] |
magnitude_spherical (3D) |
"C is 3.2 units from A at θ=45°, φ=30°." | target = anchor + m·spherical(θ, φ) |
midpoint |
"D is the midpoint of A, B, C." | target = mean(points) |
weighted_centroid |
"E is the centroid of A (w=2), B (w=1)." | target = weighted_mean(points, w) |
projection |
"F is the projection of C onto line AB." | `target = A + ((C-A)·(B-A) / |
The Space object holds a DAG of PointDefinitions and resolves any point to an absolute coordinate by topological walk. The engine is the single source of truth for ground truth.
Transforms mutate the geometric state mid-scenario. They are the primary mechanism for testing the shifting axis.
| Transform | Parameters |
|---|---|
rotation |
points, axis, center, angle |
translation |
points, displacement |
reflection |
points, plane_normal, plane_point |
scaling |
points, center, factor |
When a point is transformed, everything defined relative to it must update. The engine propagates through the dependency DAG; the model is expected to do the same.
Each query has a closed-form deterministic answer. Scoring is judge-free and depth-independent (so a model can't trade accuracy for depth).
| Tier | Threshold | Score |
|---|---|---|
| EXACT | error < 0.5 units | 1.0 |
| CLOSE | error < 2.0 units | 0.7 |
| APPROXIMATE | error < 5.0 units | 0.3 |
| WRONG | otherwise | 0.0 |
Absolute (not relative) error: a reasoning chain that drifts 5 units off should get the same penalty whether the target lives near the origin or far from it.
| Tier | Threshold | Score |
|---|---|---|
| EXACT | rel. error < 1% | 1.0 |
| CLOSE | rel. error < 5% | 0.7 |
| APPROXIMATE | rel. error < 15% | 0.3 |
| WRONG | otherwise | 0.0 |
rel_error = |pred - truth| / max(|truth|, ε) with ε = 1.0 to avoid blow-up near zero.
"Is X closer to Y or Z?" → 1.0 if exact match, 0.0 otherwise.
Unparseable model output → 0.0. Tracked separately so refusal can be distinguished from incorrect answers in diagnostic profiles.
Models are instructed to emit answers as:
[Answer q_001] (3.0, -1.5, 2.0)
[Answer q_002] 5.385
[Answer q_003] B
The parser is robust to chain-of-thought reasoning leaking into the response. Strategy, in order:
- Split by
[Answer q_ID]tags — extract the line after each tag. - Fall back to
[Query q_ID]tags if the answer count doesn't match. - Fall back to full text for each query.
Within each block, coordinate / distance / boolean parsers use the last regex match, not the first. This means intermediate scratch-work like "the midpoint is at (1, 2, 3) so..." is skipped and only the final value is captured.
Nine Kaggle benchmark notebooks in tasks/ — three attention axes × three difficulty levels. Each notebook evaluates a model across two pinned levels of its target axis with 10 seeds per level → 20 scenarios per task, 180 total. All tasks use POSITION queries targeting deep points.
Background: depth 5, transform_prob=0.1, dim=3. Noise multiplier sets num_points = round(depth × multiplier).
| Task | Noise multiplier | num_points | Distractor share |
|---|---|---|---|
selective_short |
1.0 – 1.5× | 5 – 8 | 0% – 33% |
selective_medium |
2.0 – 3.0× | 10 – 15 | 50% – 67% |
selective_long |
4.0 – 5.0× | 20 – 25 | 75% – 80% |
Background: noise 1.5×, transform_prob=0.1, dim=3. Queries are forced to depth ≥ depth − 2.
| Task | Depths | num_points |
|---|---|---|
sustained_short |
3 – 6 | round(depth × 1.5) |
sustained_medium |
9 – 12 | round(depth × 1.5) |
sustained_long |
15 – 18 | round(depth × 1.5) |
Background: depth 6, num_points=12, dim=3. Queries are placed after transforms, query_min_depth=4. Transforms cascade: rotating point D invalidates everything downstream.
| Task | Transform density |
|---|---|
shifting_short |
0.0 – 0.1 |
shifting_medium |
0.2 – 0.3 |
shifting_long |
0.4 – 0.5 |
| Axis | Tasks | Scenarios | Queries |
|---|---|---|---|
| Selective | 3 | 60 | ~180 |
| Sustained | 3 | 60 | ~180 |
| Shifting | 3 | 60 | ~180 |
| Total | 9 | 180 | ~540 |
zelusbench/
├── README.md
├── submission.md # Kaggle competition writeup
├── pyproject.toml
│
├── zelusbench/ # Core library (uploaded as Kaggle dataset)
│ ├── geometry/
│ │ ├── point.py # PointDefinition: 7 def types
│ │ ├── space.py # DAG, propagation, resolution
│ │ ├── transforms.py # rotation/translation/reflection/scaling
│ │ └── vectors.py # vec math utilities
│ │
│ ├── scenarios/
│ │ ├── config.py # ScenarioConfig + randomize_except
│ │ ├── generator.py # ScenarioGenerator: phased loop
│ │ └── templates.py # NL templates for statements/queries
│ │
│ └── evaluation/
│ ├── parser.py # Robust last-match extraction
│ ├── scorer.py # Tiered absolute/relative/binary scoring
│ └── reports.py # Diagnostic profiles
│
├── tasks/ # 9 Kaggle benchmark notebooks
│ ├── zelusbench_selective_{short,medium,long}.ipynb
│ ├── zelusbench_sustained_{short,medium,long}.ipynb
│ └── zelusbench_shifting_{short,medium,long}.ipynb
│
├── tests/ # 79 unit tests
│ ├── test_point.py # 21 tests — all point def types
│ ├── test_space.py # 12 tests — DAG, propagation
│ ├── test_transforms.py # 14 tests — all transforms
│ └── test_scenarios.py # 32 tests — generation, parsing, scoring
│
└── viz.ipynb # Visualisations (animated 3D scenarios)
Spaceis the single source of truth. All transforms mutate theSpace; ground truth comes from resolving the DAG. If the engine is correct, ground truth is correct.- Single generative loop. Each point, transform, and query is generated inline using only points that already exist. No forward references, no separate distractor concept.
- Continuous knobs.
leaf_bias,num_points,transform_probare continuous parameters, not discrete categories — enables smooth difficulty curves. - Deterministic from seed. Every scenario is reproducible from
ScenarioConfig + seed. - Background determinism across levels. Within a task, the same seed index produces the same randomized background regardless of pinned-knob value, so score deltas isolate the causal effect of the pinned axis.
- Depth-targeted queries. When testing sustained attention, queries land at the requested chain depth — not wherever the generator happens to place them.
79 unit tests covering the four engine layers. Run with:
uv run pytest| File | Tests | Covers |
|---|---|---|
test_point.py |
21 | Every point definition type resolves correctly given anchor positions; serialization round-trips. |
test_space.py |
12 | DAG construction, topological resolution, transform propagation through dependents, JSON serialization. |
test_transforms.py |
14 | Rotation (2D/3D, arbitrary axis), translation, reflection (line/plane), scaling (with center), composition. |
test_scenarios.py |
32 | Scenario generation, depth targeting, prompt rendering, response parsing (incl. CoT-leak resilience), tiered scoring. |
The engine is the only thing that needs to be correct — if Space resolves coordinates faithfully, ground truth is correct by construction. Tests focus on that invariant.
Each notebook in tasks/ is a standalone Kaggle Benchmarks task using the kaggle_benchmarks (kbench) library.
- Upload the
zelusbench/package as a Kaggle dataset. - Each notebook imports from the dataset and uses
@kbench.task+llm.prompt(). - Each notebook ends with
%choose <task_name>to register with the benchmark runner.
@kbench.task(name="zelusbench_sustained_long")
def zelusbench_sustained_long(llm) -> tuple[float, float]:
scores = []
for depth in [15, 18]:
for i in range(10):
bg_rng = random.Random(i * 7919) # deterministic background
cfg = ScenarioConfig.randomize_except(bg_rng, pinned={
"min_chain_depth": depth,
"max_chain_depth": depth,
"num_points": round(depth * 1.5),
"transform_prob": 0.1,
"dim": 3,
"query_min_depth": depth - 2,
"query_types": [QueryType.POSITION],
"seed": depth * 1000 + i,
})
scenario = ScenarioGenerator(cfg).generate(scenario_id=f"d{depth}_s{i}")
response = llm.prompt(scenario.prompt)
scores.extend(score_response(response, scenario))
mean = np.mean([s.score for s in scores])
sem = np.std([s.score for s in scores]) / np.sqrt(len(scores))
return mean, sem
zelusbench_sustained_long.run(llm=kbench.llm)Scoring is local and deterministic — no judge model, no network round-trip per query.