feat(contracts): add image_qa assertion to the outcome contract DSL#1445
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
CI fix for #1445: parent #1441 introduces the image_qa MCP tool and the contract DSL adds `kind: 'image_qa'` to the Assertion union. The narrowing switch in src/tools/oc-assert.ts:expectedActualFor was missed in #1445, causing TS2366 (function lacks ending return). Mirror the existing branches: surface `question` and `expected_pattern` as the expected slice, and pass through the evaluator's `details` as the actual slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI fix for #1445: parent #1441 introduces the image_qa MCP tool and the contract DSL adds `kind: 'image_qa'` to the Assertion union. The narrowing switch in src/tools/oc-assert.ts:expectedActualFor was missed in #1445, causing TS2366 (function lacks ending return). Mirror the existing branches: surface `question` and `expected_pattern` as the expected slice, and pass through the evaluator's `details` as the actual slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c96aaa6 to
fb7546d
Compare
Goal-review verdict — #1445Intent vs SSOT (#1359)
Code-level review
CI fixes applied
StatusMerge-ready pending parent #1441 landing. No further changes expected. |
New `image_qa` leaf assertion: `{ kind: 'image_qa', question,
expected_pattern }`. Evaluates by asking the host LLM a question about
the most recent screenshot through the optional EvalContext
`imageQaSample` hook, then matching the answer against
`expected_pattern` (JS RegExp, ReDoS-guarded by the existing
validateRegexPattern).
When the runtime does not wire `imageQaSample` OR when the host
returns `unsupported_by_host`, the assertion is inconclusive with
`passed: false`. OpenChrome never falls back to a server-side LLM,
matching the SSOT #1359 boundary established by the image_qa MCP tool
in Part 1.
Part 2 of #1432.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI fix for #1445: parent #1441 introduces the image_qa MCP tool and the contract DSL adds `kind: 'image_qa'` to the Assertion union. The narrowing switch in src/tools/oc-assert.ts:expectedActualFor was missed in #1445, causing TS2366 (function lacks ending return). Mirror the existing branches: surface `question` and `expected_pattern` as the expected slice, and pass through the evaluator's `details` as the actual slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fb7546d to
7a20a21
Compare
Completes the runtime wire-up promised in the original #1432-B brief that was deferred from PR #1445. The DSL clause and evaluator landed in Part 2, but EvalContext.imageQaSample was never populated, so every image_qa contract evaluated to inconclusive in production. Changes: - Extract the sampling forwarding from the image_qa tool handler into a reusable `runImageQaSampling(ctx, params)` export. - oc_assert.buildEvalContext now accepts the calling ToolContext and injects an imageQaSample closure that delegates to runImageQaSampling, threading the host's sampling capability + requestClient bridge straight through. - oc_assert handler signature accepts the optional context param. - The image_qa evaluator now stamps `details.error` (in addition to `details.reason`) on infra-fault paths so oc_assert's isInconclusive check translates them to verdict='inconclusive' instead of 'fail'. Runtime wiring problems are not contract failures. SSOT (#1359) alignment: no server-side LLM ever — the closure only forwards via sampling/createMessage, and falls back to unsupported_by_host when the host lacks the capability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the runtime wire-up promised in the original #1432-B brief that was deferred from PR #1445. The DSL clause and evaluator landed in Part 2, but EvalContext.imageQaSample was never populated, so every image_qa contract evaluated to inconclusive in production. Changes: - Extract the sampling forwarding from the image_qa tool handler into a reusable `runImageQaSampling(ctx, params)` export. - oc_assert.buildEvalContext now accepts the calling ToolContext and injects an imageQaSample closure that delegates to runImageQaSampling, threading the host's sampling capability + requestClient bridge straight through. - oc_assert handler signature accepts the optional context param. - The image_qa evaluator now stamps `details.error` (in addition to `details.reason`) on infra-fault paths so oc_assert's isInconclusive check translates them to verdict='inconclusive' instead of 'fail'. Runtime wiring problems are not contract failures. SSOT (#1359) alignment: no server-side LLM ever — the closure only forwards via sampling/createMessage, and falls back to unsupported_by_host when the host lacks the capability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Analysis & merge summary (automated review)Intent / direction. Part 2 of #1432: adds an #1359 (SSOT) alignment — consistent. No server-side LLM. The evaluator only calls the host-provided hook; when the hook is absent / returns Note on the diff. Against current Review findings (both addressed in the follow-up #1452, by design):
Verification. Merged to |
Review analysis — merge-readinessIntent & direction. Part 2 of #1432: adds an #1359 alignment — PASS. No path where OpenChrome itself invokes a model: the evaluator returns inconclusive ( Findings & fixes applied this pass:
Exhaustiveness checked: Tests: Verdict: MERGE-READY once CI is green. Targets |
) Activates the image_qa contract assertion (added in #1445): buildEvalContext now injects an imageQaSample closure that forwards a screenshot+question to the host via MCP sampling/createMessage (runImageQaSampling), and the evaluator stamps details.error on infra-fault paths so verdicts translate to 'inconclusive' (not 'fail'). No server-side LLM; deterministic inconclusive fallback when sampling is absent (SSOT #1359 P2/P4). Review fixes folded in: (1) swapped raw new RegExp for compileSafeRegex on the now-reachable regex path (defense-in-depth ReDoS guard, resolves the #1445 P1); (2) guard non-text sampling content blocks -> inconclusive instead of a vacuous empty-string regex pass, with a regression test. Rebased onto develop to contain only the runtime-hook commit; retargeted base to develop. build + 43 image_qa/oc-assert tests pass. CI: ubuntu x3, windows x3, agent-success-mock green; macOS jobs were queue-delayed on GitHub's runners (platform-neutral change).
…(follow-up to #1445) (#1453) * docs(contracts): document image_qa kind and list it in oc_assert schema The image_qa leaf was added to the validator/evaluator but the host-facing surfaces still enumerated the old kind set: docs/contracts.md omitted it entirely and oc_assert's `contract` inputSchema description left it out of the kind list, so hosts had no portable way to discover the assertion. Add a DSL reference section (sampling-backed, inconclusive without a host hook per #1359) and add image_qa to the schema description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(contracts): route image_qa evaluator regex through compileSafeRegex The evaluator recompiled expected_pattern with a bare `new RegExp`, the only pattern-matching evaluator in src/contracts not routed through the safe-regex guard. A pattern that reached evaluation without validation (e.g. a future persistence/cache path) would compile unguarded on the hot path. Use compileSafeRegex for defense-in-depth and consistency, and add a validator test for a missing expected_pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Stacked on #1441 (Part 1).
image_qaleaf assertion in the contract DSL:{ question, expected_pattern }.EvalContext.imageQaSamplehook so the runtime can route to theimage_qaMCP tool (and hence to host sampling). Absent hook orunsupported_by_hostreply → inconclusive (passed:false).evaluate()(1).SSOT (#1359) alignment
validateRegexPatternat author time andcompileSafeRegexat evaluation time.Review-pass changes
image_qakind indocs/contracts.mdand added it to theoc_assertcontractschema description (hosts can now discover it).compileSafeRegexfor consistency/defense-in-depth.expected_pattern.Part 2 of #1432. Depends on #1441.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com