Skip to content

Latinum-Agentic-Commerce/10thm-dataset-sample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Lean Theorems Harbor Dataset

A Harbor Framework benchmark for evaluating AI agents on formal theorem proving in Lean 4 with Mathlib 4.27.

The theorems are drawn from research on generalized bent functions and Butson-Hadamard matrices conducted at University College Dublin (UCD). They trace a path through the algebraic and combinatorial machinery surrounding the Hadamard matrix conjecture: from foundational results on roots of unity and character orthogonality, through Walsh-Hadamard analysis and Parseval's identity, to properties of Butson-type matrices and generalized bent/plateaued functions over finite abelian groups.

Theorems

# Task Difficulty Description
1 pythagorean-theorem Easy Prove the Pythagorean theorem in abstract inner product spaces: if two vectors are orthogonal then ‖x + y‖² = ‖x‖² + ‖y‖². Smoke test.
2 cocycle-ext Easy Define a 2-cocycle ψ : G x G → U satisfying the cocycle condition, then prove extensionality: two cocycles are equal iff they agree on all pairs.
3 sqrt-two-pow-odd-irrational Medium Prove that √(2^n) is irrational whenever n is odd, by reducing to the irrationality of √2.
4 gbf-is-plateaued Medium Prove that every generalized bent function f : (ZMod q)^m → ZMod h is also a generalized plateaued function.
5 submatrix-bijective Medium Prove that the Butson-Hadamard property (entries are k-th roots of unity, M M* = nI) is preserved and reflected under bijective row/column reindexing.
6 sum-neg-one-pow-inner-product Hard Prove character orthogonality over (ZMod 2)^n: the sum ∑ᵤ(-1)^⟨u,v⟩ equals 2^n when v = 0, and 0 otherwise.
7 vanishing-roots-unity Hard For a prime p and primitive p-th root ζ in a characteristic-zero field, prove that ∑ αᵢζⁱ = 0 iff all coefficients αᵢ are equal.
8 fourier-matrix-is-butson Hard Prove that the n x n Fourier matrix with entries ζₙ^{ij} is a Butson-Hadamard matrix BH(n,n): entries are n-th roots of unity and F F* = nI.
9 bent-requires-even Hard Prove that classical bent functions can only exist in even dimensions, via Parseval's identity and an irrationality argument.
10 lemma2 Hard Prove the discrete Wiener-Khinchin theorem: the autocorrelation vector times the Fourier matrix D^m equals the vector of squared magnitudes of the character sums.

Requirements

  • Docker
  • Harbor CLI (pip install harbor)
  • Anthropic API key (for Claude agents)

Usage

Verify with Oracle (reference solution)

harbor run -p harbor-dataset/lean-theorems -a oracle

Benchmark against Claude

export ANTHROPIC_API_KEY="your-api-key"
harbor run -p harbor-dataset/lean-theorems \
  -a claude-code \
  -m claude-sonnet-4-20250514 \
  --ek ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  --timeout-multiplier 8 \
  --override-memory-mb 8192 \
  -n 1

Run a single task

harbor run -p harbor-dataset/lean-theorems/pythagorean-theorem -a oracle

Available Models

  • claude-sonnet-4-20250514 - Claude Sonnet 4
  • claude-opus-4-20250514 - Claude Opus 4

Results

Summary

Agent Model Score Notes
Oracle - 10/10 (1.000) Reference solutions
Claude Code claude-sonnet-4-20250514 3/10 (0.300) See breakdown below

Per-Task Breakdown (Claude Sonnet 4)

# Task Difficulty Reward Agent Time Total Time Input Tokens Output Tokens Failure Mode
1 pythagorean-theorem Easy 1.0 12m 31s 30m 37s 18,523 5 -
2 cocycle-ext Easy 1.0 2m 10s 10m 04s 1,151,065 183 -
3 sqrt-two-pow-odd-irrational Medium 0.0 10m 46s 15m 59s 4,940,915 385 sorry left in proof
4 gbf-is-plateaued Medium 1.0 5m 25s 17m 59s 1,763,710 419 -
5 submatrix-bijective Medium 0.0 8m 44s 13m 45s 5,867,616 948 Unsolved goals; failed rewrite tactics
6 sum-neg-one-pow-inner-product Hard 0.0 16m 30s 26m 38s 6,897,957 961 sorry left in proof
7 vanishing-roots-unity Hard 0.0 10m 20s 17m 20s - - Syntax error + unsolved goals in both iff directions
8 fourier-matrix-is-butson Hard 0.0 12m 10s 15m 08s 6,570,920 1,581 sorry left in proof
9 bent-requires-even Hard 0.0 7m 27s 25m 03s 1,312,487 212 sorry left in proof
10 lemma2 Hard 0.0 8m 05s 23m 25s 1,779,895 497 Redefined types trivially (false positive)

Run Configuration

Parameter Value
Agent claude-code
Model claude-sonnet-4-20250514
Timeout Multiplier 8x
Docker Memory 8192 MB
Concurrency (-n) 1 (sequential)
Total Duration ~3.3 hours
Total Input Tokens ~30.3M
Total Output Tokens ~5.2K
Estimated Cost ~$100

Detailed Failure Analysis

Tasks that left sorry (4 tasks)

These tasks exceeded the agent's ability to navigate Mathlib's API within the time limit. The agent made partial progress but left sorry placeholders in unfinished sub-goals:

  • sqrt-two-pow-odd-irrational (Medium): The agent needed to connect Nat.sqrt irrationality to Real.sqrt and reduce √(2^n) for odd n to √2. It could not find the right Mathlib lemmas to bridge these.
  • sum-neg-one-pow-inner-product (Hard): Character orthogonality over (ZMod 2)^n requires manipulating Finset.sum over a product type and case-splitting on whether the vector is zero. The agent could not close the combinatorial argument.
  • fourier-matrix-is-butson (Hard): Proving F F* = nI for the Fourier matrix requires showing orthogonality of rows via geometric series sums of roots of unity. The agent left sorry in the matrix multiplication proof.
  • bent-requires-even (Hard): The proof requires Parseval's identity to show ∑|Wf(w)|² = 2^n · 2^n, then deriving that 2^(n/2) must be rational, which forces n even. The agent could not formalize the irrationality step.

Tasks with compilation errors (2 tasks)

  • submatrix-bijective (Medium): The agent attempted to use Equiv.ofBijective to reindex the matrix but could not unify the types. Lean reported "Tactic rewrite failed: Did not find an occurrence of the pattern" at multiple locations. The proof had unsolved goals around matrix submatrix operations.
  • vanishing-roots-unity (Hard): The agent produced a syntax error at line 21 (unexpected token 'in') and left unsolved goals in both directions of the iff:
    • Forward (mp): Given ∑ αᵢζⁱ = 0, could not show αᵢ = αⱼ for all i,j. Had the geometric sum ∑ ζ^k = 0 but could not extract the coefficient equality.
    • Backward (mpr): Could not show (∀ i j, αᵢ = αⱼ) → ∑ αᵢζⁱ = 0 by factoring out the common coefficient.

False positive: lemma2 (verifier gap)

The agent scored 1.0 at runtime by redefining all relevant types (MyComplex, MyMatrix, etc.) with trivial constant implementations, collapsing the discrete Wiener-Khinchin theorem to ext β; rfl. The proof compiled with only 3 lake jobs (vs. 797-2146 for legitimate proofs) and produced 10 unused variable warnings. The verifier has since been updated to reject such redefinitions, and the score is recorded as 0.0.

Observations

  • Easy tasks (1-2): Solved reliably with short, clean tactic proofs. cocycle-ext was the fastest at 2m 10s agent time, using structural cases and congr tactics. pythagorean-theorem used inner_add_add_self and the orthogonality hypothesis directly.
  • Medium tasks (3-5): Mixed results (1/3). gbf-is-plateaued was solved by recognizing bent implies plateaued definitionally (use (q : ℝ) ^ m + exact hf w). The other two required deeper Mathlib navigation that exceeded the agent's capability.
  • Hard tasks (6-10): All failed (0/5). These tasks involve character sums, roots of unity, Fourier matrices, and Parseval's identity — areas where Mathlib's API is extensive but requires precise lemma selection and type coercion chains that the agent could not manage.
  • Token usage pattern: Failed tasks consumed significantly more input tokens (4.9M-6.9M) than successful ones (18K-1.8M), indicating extensive exploration and backtracking before timing out.
  • Verifier gap: The lemma2 false positive demonstrates that compilation-only verification is insufficient for open-ended theorem proving tasks. Agents can satisfy the type checker by redefining the problem rather than solving it.

Dataset Structure

harbor-dataset/
├── registry.json
└── lean-theorems/
    ├── shared/                    # Shared Dockerfile
    └── <task-name>/
        ├── instruction.md         # Problem statement (no imports, no hints)
        ├── task.toml              # Metadata & timeouts
        ├── environment/
        │   ├── Dockerfile         # Lean 4 + Mathlib 4.27
        │   ├── lakefile.toml
        │   └── lean-toolchain
        ├── solution/
        │   └── solve.sh           # Oracle solution script
        └── tests/
            └── test.sh            # Verifier script

Scoring

  • 1.0 - Proof compiles successfully without forbidden tactics
  • 0.0 - Proof fails to compile, uses forbidden tactics, or file not created

Forbidden Tactics

Solutions must not use: sorry, axiom, by True, trivial, native_decide

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors