A Harbor Framework benchmark for evaluating AI agents on formal theorem proving in Lean 4 with Mathlib 4.27.
The theorems are drawn from research on generalized bent functions and Butson-Hadamard matrices conducted at University College Dublin (UCD). They trace a path through the algebraic and combinatorial machinery surrounding the Hadamard matrix conjecture: from foundational results on roots of unity and character orthogonality, through Walsh-Hadamard analysis and Parseval's identity, to properties of Butson-type matrices and generalized bent/plateaued functions over finite abelian groups.
| # | Task | Difficulty | Description |
|---|---|---|---|
| 1 | pythagorean-theorem |
Easy | Prove the Pythagorean theorem in abstract inner product spaces: if two vectors are orthogonal then ‖x + y‖² = ‖x‖² + ‖y‖². Smoke test. |
| 2 | cocycle-ext |
Easy | Define a 2-cocycle ψ : G x G → U satisfying the cocycle condition, then prove extensionality: two cocycles are equal iff they agree on all pairs. |
| 3 | sqrt-two-pow-odd-irrational |
Medium | Prove that √(2^n) is irrational whenever n is odd, by reducing to the irrationality of √2. |
| 4 | gbf-is-plateaued |
Medium | Prove that every generalized bent function f : (ZMod q)^m → ZMod h is also a generalized plateaued function. |
| 5 | submatrix-bijective |
Medium | Prove that the Butson-Hadamard property (entries are k-th roots of unity, M M* = nI) is preserved and reflected under bijective row/column reindexing. |
| 6 | sum-neg-one-pow-inner-product |
Hard | Prove character orthogonality over (ZMod 2)^n: the sum ∑ᵤ(-1)^⟨u,v⟩ equals 2^n when v = 0, and 0 otherwise. |
| 7 | vanishing-roots-unity |
Hard | For a prime p and primitive p-th root ζ in a characteristic-zero field, prove that ∑ αᵢζⁱ = 0 iff all coefficients αᵢ are equal. |
| 8 | fourier-matrix-is-butson |
Hard | Prove that the n x n Fourier matrix with entries ζₙ^{ij} is a Butson-Hadamard matrix BH(n,n): entries are n-th roots of unity and F F* = nI. |
| 9 | bent-requires-even |
Hard | Prove that classical bent functions can only exist in even dimensions, via Parseval's identity and an irrationality argument. |
| 10 | lemma2 |
Hard | Prove the discrete Wiener-Khinchin theorem: the autocorrelation vector times the Fourier matrix D^m equals the vector of squared magnitudes of the character sums. |
- Docker
- Harbor CLI (
pip install harbor) - Anthropic API key (for Claude agents)
harbor run -p harbor-dataset/lean-theorems -a oracleexport ANTHROPIC_API_KEY="your-api-key"
harbor run -p harbor-dataset/lean-theorems \
-a claude-code \
-m claude-sonnet-4-20250514 \
--ek ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
--timeout-multiplier 8 \
--override-memory-mb 8192 \
-n 1harbor run -p harbor-dataset/lean-theorems/pythagorean-theorem -a oracleclaude-sonnet-4-20250514- Claude Sonnet 4claude-opus-4-20250514- Claude Opus 4
| Agent | Model | Score | Notes |
|---|---|---|---|
| Oracle | - | 10/10 (1.000) | Reference solutions |
| Claude Code | claude-sonnet-4-20250514 | 3/10 (0.300) | See breakdown below |
| # | Task | Difficulty | Reward | Agent Time | Total Time | Input Tokens | Output Tokens | Failure Mode |
|---|---|---|---|---|---|---|---|---|
| 1 | pythagorean-theorem |
Easy | 1.0 | 12m 31s | 30m 37s | 18,523 | 5 | - |
| 2 | cocycle-ext |
Easy | 1.0 | 2m 10s | 10m 04s | 1,151,065 | 183 | - |
| 3 | sqrt-two-pow-odd-irrational |
Medium | 0.0 | 10m 46s | 15m 59s | 4,940,915 | 385 | sorry left in proof |
| 4 | gbf-is-plateaued |
Medium | 1.0 | 5m 25s | 17m 59s | 1,763,710 | 419 | - |
| 5 | submatrix-bijective |
Medium | 0.0 | 8m 44s | 13m 45s | 5,867,616 | 948 | Unsolved goals; failed rewrite tactics |
| 6 | sum-neg-one-pow-inner-product |
Hard | 0.0 | 16m 30s | 26m 38s | 6,897,957 | 961 | sorry left in proof |
| 7 | vanishing-roots-unity |
Hard | 0.0 | 10m 20s | 17m 20s | - | - | Syntax error + unsolved goals in both iff directions |
| 8 | fourier-matrix-is-butson |
Hard | 0.0 | 12m 10s | 15m 08s | 6,570,920 | 1,581 | sorry left in proof |
| 9 | bent-requires-even |
Hard | 0.0 | 7m 27s | 25m 03s | 1,312,487 | 212 | sorry left in proof |
| 10 | lemma2 |
Hard | 0.0 | 8m 05s | 23m 25s | 1,779,895 | 497 | Redefined types trivially (false positive) |
| Parameter | Value |
|---|---|
| Agent | claude-code |
| Model | claude-sonnet-4-20250514 |
| Timeout Multiplier | 8x |
| Docker Memory | 8192 MB |
Concurrency (-n) |
1 (sequential) |
| Total Duration | ~3.3 hours |
| Total Input Tokens | ~30.3M |
| Total Output Tokens | ~5.2K |
| Estimated Cost | ~$100 |
These tasks exceeded the agent's ability to navigate Mathlib's API within the time limit. The agent made partial progress but left sorry placeholders in unfinished sub-goals:
sqrt-two-pow-odd-irrational(Medium): The agent needed to connectNat.sqrtirrationality toReal.sqrtand reduce√(2^n)for oddnto√2. It could not find the right Mathlib lemmas to bridge these.sum-neg-one-pow-inner-product(Hard): Character orthogonality over(ZMod 2)^nrequires manipulatingFinset.sumover a product type and case-splitting on whether the vector is zero. The agent could not close the combinatorial argument.fourier-matrix-is-butson(Hard): ProvingF F* = nIfor the Fourier matrix requires showing orthogonality of rows via geometric series sums of roots of unity. The agent leftsorryin the matrix multiplication proof.bent-requires-even(Hard): The proof requires Parseval's identity to show∑|Wf(w)|² = 2^n · 2^n, then deriving that2^(n/2)must be rational, which forcesneven. The agent could not formalize the irrationality step.
submatrix-bijective(Medium): The agent attempted to useEquiv.ofBijectiveto reindex the matrix but could not unify the types. Lean reported "Tacticrewritefailed: Did not find an occurrence of the pattern" at multiple locations. The proof had unsolved goals around matrix submatrix operations.vanishing-roots-unity(Hard): The agent produced a syntax error at line 21 (unexpected token 'in') and left unsolved goals in both directions of the iff:- Forward (
mp): Given∑ αᵢζⁱ = 0, could not showαᵢ = αⱼfor alli,j. Had the geometric sum∑ ζ^k = 0but could not extract the coefficient equality. - Backward (
mpr): Could not show(∀ i j, αᵢ = αⱼ) → ∑ αᵢζⁱ = 0by factoring out the common coefficient.
- Forward (
The agent scored 1.0 at runtime by redefining all relevant types (MyComplex, MyMatrix, etc.) with trivial constant implementations, collapsing the discrete Wiener-Khinchin theorem to ext β; rfl. The proof compiled with only 3 lake jobs (vs. 797-2146 for legitimate proofs) and produced 10 unused variable warnings. The verifier has since been updated to reject such redefinitions, and the score is recorded as 0.0.
- Easy tasks (1-2): Solved reliably with short, clean tactic proofs.
cocycle-extwas the fastest at 2m 10s agent time, using structuralcasesandcongrtactics.pythagorean-theoremusedinner_add_add_selfand the orthogonality hypothesis directly. - Medium tasks (3-5): Mixed results (1/3).
gbf-is-plateauedwas solved by recognizing bent implies plateaued definitionally (use (q : ℝ) ^ m+exact hf w). The other two required deeper Mathlib navigation that exceeded the agent's capability. - Hard tasks (6-10): All failed (0/5). These tasks involve character sums, roots of unity, Fourier matrices, and Parseval's identity — areas where Mathlib's API is extensive but requires precise lemma selection and type coercion chains that the agent could not manage.
- Token usage pattern: Failed tasks consumed significantly more input tokens (4.9M-6.9M) than successful ones (18K-1.8M), indicating extensive exploration and backtracking before timing out.
- Verifier gap: The
lemma2false positive demonstrates that compilation-only verification is insufficient for open-ended theorem proving tasks. Agents can satisfy the type checker by redefining the problem rather than solving it.
harbor-dataset/
├── registry.json
└── lean-theorems/
├── shared/ # Shared Dockerfile
└── <task-name>/
├── instruction.md # Problem statement (no imports, no hints)
├── task.toml # Metadata & timeouts
├── environment/
│ ├── Dockerfile # Lean 4 + Mathlib 4.27
│ ├── lakefile.toml
│ └── lean-toolchain
├── solution/
│ └── solve.sh # Oracle solution script
└── tests/
└── test.sh # Verifier script
1.0- Proof compiles successfully without forbidden tactics0.0- Proof fails to compile, uses forbidden tactics, or file not created
Solutions must not use: sorry, axiom, by True, trivial, native_decide