Fix D1 spec-test gaps in 3 tasks by foxie-huang · Pull Request #219 · QF-Bench/QuantitativeFinance-Bench

foxie-huang · 2026-05-06T02:54:45Z

Summary

Three tasks had confirmed D1 (Output Schema / Spec Compliance) issues where the instruction and the verifier disagreed in ways that could cause a correctly-implemented agent to fail the benchmark.

dupire-local-vol: Spec said forward variance "must be positive everywhere"; verifier accepted ≥40%. Root cause: the oracle always writes forward_variance = 0 for the last expiry (no next tenor to difference against), so the verifier must exclude that boundary row. New test excludes the last row and requires ≥90% of interior values to be positive, consistent with the updated spec language.
mc-greek-surface-1: Spec said LR Gamma "requires the second derivative of the log-likelihood" but gave no formula. Verifier asserts a non-null value within 10% of BS Gamma. Added the explicit score-function estimator so agents know what to compute: Γ_LR = e^{-rT} · E[payoff · (Z²−1−Zσ√T) / (S₀σ√T)²].
fx-forward-cross-rate: Spec gave only conceptual guidance ("think about which side a dealer would trade") while the verifier asserts specific cross-rate mid values to 5×10⁻⁴ tolerance. Added the explicit bid×bid / ask×ask triangulation formula with a GBP/JPY worked example.

Test plan

Run oracle on dupire-local-vol — confirm all tests still pass with new 90% threshold
Run oracle on mc-greek-surface-1 — confirm LR Gamma test passes
Run oracle on fx-forward-cross-rate — confirm cross-rate tests pass

🤖 Generated with Claude Code

…ward-cross-rate dupire-local-vol: The spec said forward variance "must be positive everywhere" but the test accepted ≥40% positive — a direct contradiction. The oracle always stores forward_variance=0 for the last expiry (no next tenor to difference against), so the test should exclude that row. New test skips the last row and requires ≥90% of interior forward variances to be positive, matching the corrected spec language. mc-greek-surface-1: The spec said LR Gamma "requires the second derivative of the log-likelihood" but gave no formula, while the test asserts a non-null value within 10% of BS Gamma. Added the explicit score-function estimator: Γ_LR = e^{-rT} · E[payoff · (Z²−1−Zσ√T) / (S₀σ√T)²] so agents know exactly what to compute. fx-forward-cross-rate: The spec gave only conceptual guidance ("think about which side a dealer would trade") while the test asserts specific cross-rate mid values to 5e-4 tolerance. Added the explicit triangulation formula (bid×bid / ask×ask for each USD leg) with a GBP/JPY example so agents can implement it without guessing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PQCat approved these changes May 6, 2026

View reviewed changes

labubububula78-poop approved these changes May 6, 2026

View reviewed changes

foxie-huang merged commit fcb39e7 into main May 6, 2026
2 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix D1 spec-test gaps in 3 tasks#219

Fix D1 spec-test gaps in 3 tasks#219
foxie-huang merged 1 commit into
mainfrom
fix/d1-spec-test-alignment

foxie-huang commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

foxie-huang commented May 6, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants