Lean Theorems Harbor Dataset

A Harbor Framework benchmark for evaluating AI agents on formal theorem proving in Lean 4 with Mathlib 4.27.

The theorems are drawn from research on generalized bent functions and Butson-Hadamard matrices conducted at University College Dublin (UCD). They trace a path through the algebraic and combinatorial machinery surrounding the Hadamard matrix conjecture: from foundational results on roots of unity and character orthogonality, through Walsh-Hadamard analysis and Parseval's identity, to properties of Butson-type matrices and generalized bent/plateaued functions over finite abelian groups.

Theorems

#	Task	Difficulty	Description
1	`pythagorean-theorem`	Easy	Prove the Pythagorean theorem in abstract inner product spaces: if two vectors are orthogonal then ‖x + y‖² = ‖x‖² + ‖y‖². Smoke test.
2	`cocycle-ext`	Easy	Define a 2-cocycle ψ : G x G → U satisfying the cocycle condition, then prove extensionality: two cocycles are equal iff they agree on all pairs.
3	`sqrt-two-pow-odd-irrational`	Medium	Prove that √(2^n) is irrational whenever n is odd, by reducing to the irrationality of √2.
4	`gbf-is-plateaued`	Medium	Prove that every generalized bent function f : (ZMod q)^m → ZMod h is also a generalized plateaued function.
5	`submatrix-bijective`	Medium	Prove that the Butson-Hadamard property (entries are k-th roots of unity, M M* = nI) is preserved and reflected under bijective row/column reindexing.
6	`sum-neg-one-pow-inner-product`	Hard	Prove character orthogonality over (ZMod 2)^n: the sum ∑ᵤ(-1)^⟨u,v⟩ equals 2^n when v = 0, and 0 otherwise.
7	`vanishing-roots-unity`	Hard	For a prime p and primitive p-th root ζ in a characteristic-zero field, prove that ∑ αᵢζⁱ = 0 iff all coefficients αᵢ are equal.
8	`fourier-matrix-is-butson`	Hard	Prove that the n x n Fourier matrix with entries ζₙ^{ij} is a Butson-Hadamard matrix BH(n,n): entries are n-th roots of unity and F F* = nI.
9	`bent-requires-even`	Hard	Prove that classical bent functions can only exist in even dimensions, via Parseval's identity and an irrationality argument.
10	`lemma2`	Hard	Prove the discrete Wiener-Khinchin theorem: the autocorrelation vector times the Fourier matrix D^m equals the vector of squared magnitudes of the character sums.

Requirements

Docker
Harbor CLI (pip install harbor)
Anthropic API key (for Claude agents)

Usage

Verify with Oracle (reference solution)

harbor run -p harbor-dataset/lean-theorems -a oracle

Benchmark against Claude

export ANTHROPIC_API_KEY="your-api-key"
harbor run -p harbor-dataset/lean-theorems \
  -a claude-code \
  -m claude-sonnet-4-20250514 \
  --ek ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  --timeout-multiplier 8 \
  --override-memory-mb 8192 \
  -n 1

Run a single task

harbor run -p harbor-dataset/lean-theorems/pythagorean-theorem -a oracle

Available Models

claude-sonnet-4-20250514 - Claude Sonnet 4
claude-opus-4-20250514 - Claude Opus 4

Results

Summary

Agent	Model	Score	Notes
Oracle	-	10/10 (1.000)	Reference solutions
Claude Code	claude-sonnet-4-20250514	3/10 (0.300)	See breakdown below

Per-Task Breakdown (Claude Sonnet 4)

#	Task	Difficulty	Reward	Agent Time	Total Time	Input Tokens	Output Tokens	Failure Mode
1	`pythagorean-theorem`	Easy	1.0	12m 31s	30m 37s	18,523	5	-
2	`cocycle-ext`	Easy	1.0	2m 10s	10m 04s	1,151,065	183	-
3	`sqrt-two-pow-odd-irrational`	Medium	0.0	10m 46s	15m 59s	4,940,915	385	`sorry` left in proof
4	`gbf-is-plateaued`	Medium	1.0	5m 25s	17m 59s	1,763,710	419	-
5	`submatrix-bijective`	Medium	0.0	8m 44s	13m 45s	5,867,616	948	Unsolved goals; failed `rewrite` tactics
6	`sum-neg-one-pow-inner-product`	Hard	0.0	16m 30s	26m 38s	6,897,957	961	`sorry` left in proof
7	`vanishing-roots-unity`	Hard	0.0	10m 20s	17m 20s	-	-	Syntax error + unsolved goals in both iff directions
8	`fourier-matrix-is-butson`	Hard	0.0	12m 10s	15m 08s	6,570,920	1,581	`sorry` left in proof
9	`bent-requires-even`	Hard	0.0	7m 27s	25m 03s	1,312,487	212	`sorry` left in proof
10	`lemma2`	Hard	0.0	8m 05s	23m 25s	1,779,895	497	Redefined types trivially (false positive)

Run Configuration

Parameter	Value
Agent	`claude-code`
Model	`claude-sonnet-4-20250514`
Timeout Multiplier	8x
Docker Memory	8192 MB
Concurrency (`-n`)	1 (sequential)
Total Duration	~3.3 hours
Total Input Tokens	~30.3M
Total Output Tokens	~5.2K
Estimated Cost	~$100

Detailed Failure Analysis

Tasks that left `sorry` (4 tasks)

These tasks exceeded the agent's ability to navigate Mathlib's API within the time limit. The agent made partial progress but left sorry placeholders in unfinished sub-goals:

sqrt-two-pow-odd-irrational (Medium): The agent needed to connect Nat.sqrt irrationality to Real.sqrt and reduce √(2^n) for odd n to √2. It could not find the right Mathlib lemmas to bridge these.
sum-neg-one-pow-inner-product (Hard): Character orthogonality over (ZMod 2)^n requires manipulating Finset.sum over a product type and case-splitting on whether the vector is zero. The agent could not close the combinatorial argument.
fourier-matrix-is-butson (Hard): Proving F F* = nI for the Fourier matrix requires showing orthogonality of rows via geometric series sums of roots of unity. The agent left sorry in the matrix multiplication proof.
bent-requires-even (Hard): The proof requires Parseval's identity to show ∑|Wf(w)|² = 2^n · 2^n, then deriving that 2^(n/2) must be rational, which forces n even. The agent could not formalize the irrationality step.

Tasks with compilation errors (2 tasks)

submatrix-bijective (Medium): The agent attempted to use Equiv.ofBijective to reindex the matrix but could not unify the types. Lean reported "Tactic rewrite failed: Did not find an occurrence of the pattern" at multiple locations. The proof had unsolved goals around matrix submatrix operations.
vanishing-roots-unity (Hard): The agent produced a syntax error at line 21 (unexpected token 'in') and left unsolved goals in both directions of the iff:
- Forward (mp): Given ∑ αᵢζⁱ = 0, could not show αᵢ = αⱼ for all i,j. Had the geometric sum ∑ ζ^k = 0 but could not extract the coefficient equality.
- Backward (mpr): Could not show (∀ i j, αᵢ = αⱼ) → ∑ αᵢζⁱ = 0 by factoring out the common coefficient.

False positive: `lemma2` (verifier gap)

The agent scored 1.0 at runtime by redefining all relevant types (MyComplex, MyMatrix, etc.) with trivial constant implementations, collapsing the discrete Wiener-Khinchin theorem to ext β; rfl. The proof compiled with only 3 lake jobs (vs. 797-2146 for legitimate proofs) and produced 10 unused variable warnings. The verifier has since been updated to reject such redefinitions, and the score is recorded as 0.0.

Observations

Easy tasks (1-2): Solved reliably with short, clean tactic proofs. cocycle-ext was the fastest at 2m 10s agent time, using structural cases and congr tactics. pythagorean-theorem used inner_add_add_self and the orthogonality hypothesis directly.
Medium tasks (3-5): Mixed results (1/3). gbf-is-plateaued was solved by recognizing bent implies plateaued definitionally (use (q : ℝ) ^ m + exact hf w). The other two required deeper Mathlib navigation that exceeded the agent's capability.
Hard tasks (6-10): All failed (0/5). These tasks involve character sums, roots of unity, Fourier matrices, and Parseval's identity — areas where Mathlib's API is extensive but requires precise lemma selection and type coercion chains that the agent could not manage.
Token usage pattern: Failed tasks consumed significantly more input tokens (4.9M-6.9M) than successful ones (18K-1.8M), indicating extensive exploration and backtracking before timing out.
Verifier gap: The lemma2 false positive demonstrates that compilation-only verification is insufficient for open-ended theorem proving tasks. Agents can satisfy the type checker by redefining the problem rather than solving it.

Dataset Structure

harbor-dataset/
├── registry.json
└── lean-theorems/
    ├── shared/                    # Shared Dockerfile
    └── <task-name>/
        ├── instruction.md         # Problem statement (no imports, no hints)
        ├── task.toml              # Metadata & timeouts
        ├── environment/
        │   ├── Dockerfile         # Lean 4 + Mathlib 4.27
        │   ├── lakefile.toml
        │   └── lean-toolchain
        ├── solution/
        │   └── solve.sh           # Oracle solution script
        └── tests/
            └── test.sh            # Verifier script

Scoring

1.0 - Proof compiles successfully without forbidden tactics
0.0 - Proof fails to compile, uses forbidden tactics, or file not created

Forbidden Tactics

Solutions must not use: sorry, axiom, by True, trivial, native_decide

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
harbor-dataset		harbor-dataset
.gitignore		.gitignore
ExportTheorem.lean		ExportTheorem.lean
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lean Theorems Harbor Dataset

Theorems

Requirements

Usage

Verify with Oracle (reference solution)

Benchmark against Claude

Run a single task

Available Models

Results

Summary

Per-Task Breakdown (Claude Sonnet 4)

Run Configuration

Detailed Failure Analysis

Tasks that left `sorry` (4 tasks)

Tasks with compilation errors (2 tasks)

False positive: `lemma2` (verifier gap)

Observations

Dataset Structure

Scoring

Forbidden Tactics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lean Theorems Harbor Dataset

Theorems

Requirements

Usage

Verify with Oracle (reference solution)

Benchmark against Claude

Run a single task

Available Models

Results

Summary

Per-Task Breakdown (Claude Sonnet 4)

Run Configuration

Detailed Failure Analysis

Tasks that left sorry (4 tasks)

Tasks with compilation errors (2 tasks)

False positive: lemma2 (verifier gap)

Observations

Dataset Structure

Scoring

Forbidden Tactics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Tasks that left `sorry` (4 tasks)

False positive: `lemma2` (verifier gap)

Packages