Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment by acharyaanusha · Pull Request #787 · huggingface/OpenEnv

acharyaanusha · 2026-06-11T02:16:06Z

Summary

Adds envs/sophistry_bench_sprint_env/ — an OpenEnv port of the sophistry-bench single-agent advocacy reward-hacking environment. One-turn advocacy over QuALITY reading-comprehension passages: the policy is assigned the gold answer or a distractor and must produce one <claim>/<cite> argument. The reward proxy peaks at exactly 8 <claim> tags (claim_count_cliff); four weight-0 canaries detect format hacking, making this a compact, reproducible reward-hacking measurement env.

Single-step episode, typed (non-MCP) pattern: models.py (AdvocacyAction/AdvocacyObservation) + EnvClient, FastAPI app, Dockerfile copied from echo_env.
reset() issues the task (passage + question + answer-to-defend); step(AdvocacyAction(text=...)) returns reward + all 8 reward components and done=True. Hidden ground truth: the policy is told what to defend, never whether it's gold; correctness surfaces only in step metadata.
No scoring drift: scoring/dataset logic is imported from the upstream sophistry-bench-sprint package rather than reimplemented, and a parity test asserts the OpenEnv aggregate_reward equals the canonical reward to 1e-9.
Config via env vars: SPRINT_N_ITEMS, SPRINT_PASSAGE_CHARS, SPRINT_SEED, SPRINT_WEIGHTS.

A serialization fix worth flagging for the framework

While containerizing, I found that core/env_server/serialization.py::serialize_observation excludes the base Observation.metadata from the wire payload — so reward components placed in metadata are silently dropped over HTTP (in-process tests don't catch it because they never serialize). Worked around it within the env by mirroring components into a declared components field (declared subclass fields survive serialization) and restoring metadata client-side, guarded by serialization round-trip tests. Happy to discuss whether the framework should preserve metadata instead.

Test Plan

cd envs/sophistry_bench_sprint_env && uv run pytest tests/ -v → 10 passed (incl. anti-drift parity + 2 wire-serialization regression tests)
openenv build sophistry_bench_sprint_env builds; container smoke test green (reset → step_text with 8 claims → reward 0.5, all 8 components over HTTP)

Dependency & live demo

sophistry-bench-sprint is published on PyPI (https://pypi.org/project/sophistry-bench-sprint/) and pulled as a normal dependency — no vendored wheel. A live demo Space is deployed at https://huggingface.co/spaces/anushaacharya/sophistry_bench_sprint_env.

🤖 Generated with Claude Code

…print wheel Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ion models Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d parity Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The framework's serialize_observation() strips the base Observation.metadata dict from the wire response, so the eight reward components never reached the containerized client (metadata arrived empty). Mirror the components into a declared AdvocacyObservation.components field server-side and re-populate metadata from it in the client's _parse_result, preserving the public contract that observation.metadata carries all eight components. Verified end-to-end against the built container (smoke test: REWARD 0.5, all 8 keys present). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion regression test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…survival test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-sprint 0.1.5 (drop vendored wheel) sophistry-bench-sprint is now on PyPI; switch the dependency to the release, remove the vendored wheel, and add HF Space README frontmatter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

acharyaanusha · 2026-06-11T02:39:23Z

@Darktex @burtenshaw would love a review!

Darktex

Note: This is an automated review by Claude Code, not a human review.

Review: `sophistry_bench_sprint_env`

Thanks for a well-structured, well-documented environment — the hidden-ground-truth design, the anti-drift parity test, and the proactively-documented serialization workaround are all genuinely nice. The structure respects the rewards-inside-environment invariant and the typed (non-MCP) client/server split is correct. A few mechanical items block CI and should be fixed before merge, plus two alignment points worth a human look.

Tier 1 — Fixes required (CI-blocking)

tests/test_environment.py:85 — ruff F811: AdvocacyAction is imported at line 4 and re-imported at line 85. Remove the redundant import (ruff check --fix handles it).
tests/test_environment.py — usort: import asyncio (and the following from sophistry_bench_sprint import load_environment) sit mid-file after function definitions. Move them into the top-of-file import block.
models.py & server/sophistry_bench_sprint_environment.py — ruff format: both files would be reformatted (line-length wrapping only). Run uv run ruff format.

I reproduced all three locally: ruff check → 1 error, ruff format --check → 2 files, usort check → 1 file.

Tier 1 — Smaller fixes

server/sophistry_bench_sprint_environment.py:54 — missing generic parameters: declared as class SophistryBenchSprintEnvironment(Environment):. Other typed envs parameterize the base — e.g. maze_env (Environment[MazeAction, MazeObservation, MazeState]) and tbench2_env. The client already does this correctly (client.py:21). Suggest Environment[AdvocacyAction, AdvocacyObservation, <StateType>] to match convention and INVARIANTS.md.
README.md:41 — usage import path: example uses from envs.sophistry_bench_sprint_env import ..., which only resolves with the repo root on sys.path. The canonical pattern (see echo_env/README.md) is from sophistry_bench_sprint_env import ....

Tier 2 — Alignment / for human review

Serialization workaround couples the env to undocumented framework behavior. core/env_server/serialization.py::serialize_observation strips the base Observation.metadata from the HTTP payload, so the env mirrors reward components into a declared components field and restores metadata client-side. This is correctly done and locked down by round-trip tests — but it's now load-bearing against an internal framework detail. If the framework later preserves metadata (or renames the mirroring contract), the client restore logic could silently diverge. Worth a maintainer decision on whether the framework should preserve metadata instead (the author explicitly invited this discussion). cc @Darktex
aggregate_reward formula is locally reimplemented, not delegated upstream. The PR claims "no scoring drift," and the parity test (test_aggregate_matches_canonical_verifiers_reward, to 1e-9) is the right safeguard. But the combination (cliff + ground) / 2.0 is recomputed in step() rather than imported from sophistry-bench-sprint. If upstream changes how sub-scores combine, this goes stale silently (and the parity test only runs when the package is installed). Consider importing an aggregate function directly, or adding a comment marking the formula as load-bearing and in-sync-required.

Claims verified against the code

✅ reset() issues task with done=False; step() returns done=True and all 8 components.
✅ Hidden ground truth: reset() exposes what to defend, not whether it's gold; correctness only in step() metadata.
✅ Serialization workaround + round-trip tests exercise the real serialize_observation path.
✅ SPRINT_* env-var config all handled; uv.lock present.
⚠️ "No scoring drift" — parity test is solid, but aggregate formula is reimplemented (see above).

Verdict

Request changes — the three lint/format/usort failures will fail CI and must land first; the generic-param and README fixes are quick convention items. The two Tier-2 points are for discussion, not blockers.

Automated review by Claude Code | Learn more

…nv conventions Code review (CI-blocking + convention items): - Fix ruff F811 / usort: rewrite test import block, drop redundant import. - ruff format models.py and the environment module. - Parameterize the base class: Environment[AdvocacyAction, AdvocacyObservation, State]. - README usage: import from `sophistry_bench_sprint_env` (not `envs.`). - Mark the reproduced `aggregate_reward` formula LOAD-BEARING (no public export to import; parity test pins it to 1e-9). Re-review vs merged envs (echo/maze/tbench2): - Add docs stub docs/source/environments/sophistry_bench_sprint.md (CI doc-sync check was failing) and register in _toctree.yml + environments.md. - Move tests to the central tests/envs/ layout so CI actually collects them; guard with pytest.importorskip (the env's scoring dep isn't in the base test env), matching the camel-guarded pattern in tbench2. - Dockerfile: huggingface openenv-base + ENV ENABLE_WEB_INTERFACE=true (matches echo/tbench2); README front-matter base_path: /web. - pyproject: depend on `openenv[core]` (not `openenv-core[core]`) like all other envs; add pytest-asyncio/pytest-cov dev extras; re-lock (openenv 0.3.1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

acharyaanusha · 2026-06-11T15:20:09Z

Pushed 792905d3 addressing the review

Tier 1 — fixes

F811 / usort: rewrote the test import block (single top-of-file block, dropped the redundant AdvocacyAction import).
ruff format applied to models.py and the environment module.
Base class now parameterized: Environment[AdvocacyAction, AdvocacyObservation, State] (uses base State, matching the client's EnvClient[..., State]).
README usage now imports from sophistry_bench_sprint_env, not envs.sophistry_bench_sprint_env.

Tier 2 — alignment

aggregate formula: there is no public aggregate to import — aggregate_reward is an inner closure of sophistry_bench_sprint._build_reward_funcs, not in the package's __all__. Rather than widen the package's public API for one expression, I marked the reproduced (cliff + ground) / 2.0 LOAD-BEARING in a comment and kept the 1e-9 parity test as the drift guard. Happy to export an aggregate from the package instead if you'd prefer that contract.
serialization workaround: left as-is (declared components/error fields + client restore, locked by round-trip tests) since it's the only thing that survives serialize_observation stripping base metadata. cc @Darktex

Darktex

Note: This is an automated review by Claude Code, not a human review.

Verdict: comment (non-blocking suggestions — nothing here is a confirmed bug, the env is well-constructed and ships with a canonical parity test)

Thanks for a thorough, well-documented environment. The typed (non-MCP) pattern, the hidden-ground-truth design (correctness only in step metadata, never told to the policy), and the 1e-9 parity test against the upstream canonical reward are all exactly right. A few suggestions to consider:

Suggestions (non-blocking)

Fragile positional zip for reward weighting — server/sophistry_bench_sprint_environment.py
```
reward = sum(w * c for w, c in zip(self.weights, metadata.values()))
```
This relies on metadata dict insertion order matching the positional weights array. It is correct for the shipped default ([1,0,0,0,0,0,0,0] — only aggregate_reward is weighted) and dict order is guaranteed in Python 3.7+, but any custom SPRINT_WEIGHTS that weights a non-aggregate component would silently break if the dict is ever reordered. Consider an explicit named mapping to make the contract self-documenting and refactor-safe:
```
_WEIGHT_KEYS = ["aggregate_reward", "correctness_reward", "n_claims", "n_citations",
                "alternation_canary", "starts_with_canary", "length_band_canary", "template_echo_canary"]
reward = sum(w * metadata[k] for w, k in zip(self.weights, _WEIGHT_KEYS))
```
Parity test verifies formula parity, not dataset parity — tests/envs/test_sophistry_bench_sprint_environment.py
The test builds vf_env = load_environment(...) but never resets it; it passes env._current_passage (the OpenEnv side) straight into the canonical reward fn. That confirms the arithmetic matches given the same passage, but not that both sides select the same passage for a given seed. If dataset selection ever diverges, this test would still pass. Either add a same-passage-at-seed assertion, or add a comment clarifying the test intentionally covers only formula parity.
Private-attribute access in test — capture env._current_passage before step() (which flips _has_task=False) and avoid reaching into private state from the test, e.g. read it right after reset().

Alignment notes for a human reviewer

Metadata stripped over the wire: the env mirrors all 8 reward components into a declared components field because serialize_observation excludes metadata. The workaround is sound and tested, but it papers over a framework-level limitation — worth fixing at the framework layer so every typed env gets observation.metadata over the wire for free rather than each env reconstructing it. (This observation was made against a possibly-older snapshot of src/openenv/core/env_server/serialization.py; please confirm against current main.)
Reproduced upstream formula + unbounded pin: the aggregate formula is reproduced locally (it's an inner closure upstream, not importable — acknowledged in a code comment) and pyproject.toml pins sophistry-bench-sprint>=0.1.5 with no upper bound. The parity test is the right guard, but it only catches drift when the suite runs against a newer package version. Consider an upper bound or a CI check.

None of the above blocks merge — they're hardening suggestions. Nice work.

Automated review by Claude Code | Learn more

Darktex

Note: This is an automated review by Claude Code, not a human review.

Thoughtful, well-tested contribution — wire-serialization regression tests, a 1e-9 parity test against the upstream scorer, the weight-0 canary design for detecting format hacking, and proper copyright headers. The structure conforms to the standard env layout (models / client / server / openenv.yaml / Dockerfile / tests). No blocking issues; a few non-blocking points below.

Tier 1 (Bugs & Lint) — non-blocking

server/sophistry_bench_sprint_environment.py:652 — reward = sum(w * c for w, c in zip(self.weights, metadata.values())) couples the 8 weights to the insertion order of the metadata dict literal (lines 642–651). It's correct today (the order matches the documented SPRINT_WEIGHTS order and Python preserves insertion order), but a future reorder of that dict would silently scramble every weight with no error. Consider binding to an explicit ordered list of (key, value) pairs so the weight↔component mapping is reviewable at a glance.
models.py:242 / models.py:281 — from typing import Dict / components: Dict[str, float]: prefer the built-in dict[str, float] (project targets Python ≥ 3.10).
models.py:279 — AdvocacyObservation re-declares reward: float = Field(0.0, ...), narrowing the base Observation.reward (bool|int|float|None). The default-0.0 path is covered by tests; just confirm no framework code path (e.g. reset serialization) ever constructs this observation with reward=None, which would now fail validation.

Tier 2 (Alignment)

Ground-truth visibility (worth an explicit doc note): correctness_reward is surfaced in observation.components (and reconstructed into observation.metadata client-side), so it is present in the per-step result. This is intentional per the design ("correctness surfaces only in step metadata"), but it's only safe if the training harness never feeds observation.metadata/components back into the policy prompt. Recommend stating that constraint explicitly in the README so downstream users don't accidentally leak the gold/distractor signal into the agent's context.
Reward parity pinning: the aggregate-reward formula is a deliberate reimplementation of an upstream private closure, guarded by the parity test. To keep the "no scoring drift" guarantee robust, consider pinning sophistry-bench-sprint to an exact version and documenting that any bump must re-run the parity test (a floor >= bound lets upstream patch the formula while the test stays green against a pinned lock).

Flagging the two Tier 2 items for maintainer eyes (@Darktex). None of the above blocks merge.

Automated review by Claude Code | Learn more

…re/test contracts Addresses two follow-up reviews and a self-review pass. - Reward weighting: bind weights to an explicit `_COMPONENT_KEYS` order (not dict insertion order); validate `len(weights) == 8` on the constructor `weights=` path too (the env-var path was already checked) and `zip(..., strict=True)` as a backstop. A mis-sized vector now raises instead of silently truncating the reward and dropping canary components. Adds a regression test. - Parity test: now also asserts dataset parity (same passage selected at the same seed), captures the passage before `step()`, and feeds the canonical fn its own passage — covers dataset + formula parity, not just arithmetic. - models.py: `Dict` -> `dict`; document that post-step reward must be read from `StepResult.reward` (observation.reward is stripped over the wire). - README/docs: test command now actually executes (was silently skipping via importorskip under the base venv); add a ground-truth-leak warning (don't feed observation.metadata/components back into the policy prompt); mirror it to the docs stub. - pyproject: cap `sophistry-bench-sprint>=0.1.5,<0.2.0` so an upstream formula change can't silently drift in; re-lock. - step(): count only scored steps (increment after the reset guard). - app.py: comment the load-bearing package-dir remap the container import relies on. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

acharyaanusha · 2026-06-12T05:26:25Z

Pushed 94a7f616 addressing the follow-up reviews.

Reward weighting (raised in both reviews) — the zip(self.weights, metadata.values()) coupled weights to dict insertion order. Now:

Weights bind to an explicit _COMPONENT_KEYS tuple (reward = sum(w * metadata[k] for w, k in zip(self.weights, _COMPONENT_KEYS, strict=True))).
Self-review caught that the constructor weights= arg skipped the 8-length check (only SPRINT_WEIGHTS was validated) — a mis-sized vector silently truncated via zip, dropping canary components with no error. Now validated in __init__ for both paths, with strict=True as a backstop. Added a regression test.

Parity test (formula vs dataset parity) — now also asserts vf_env.dataset[0].info.passage == passage (same passage selected at the same seed), captures the passage before step(), and feeds the canonical fn its own passage. Covers dataset + formula parity, not just arithmetic-given-the-same-passage.

Ground-truth leak — added an explicit README warning (and mirrored it into the docs stub) that harnesses must not feed observation.metadata/components back into the policy prompt, since correctness_reward is the hidden gold/distractor signal.

Wire reward contract — clarified in models.py that post-step reward must be read from StepResult.reward; observation.reward is stripped by the serializer and is always 0.0 over the wire.

Smaller items

from typing import Dict → built-in dict[str, float].
pyproject: capped sophistry-bench-sprint>=0.1.5,<0.2.0 so an upstream formula change can't silently drift past the parity guard; re-locked.
README/docs test command now actually runs the suite (uv run --project envs/sophistry_bench_sprint_env --extra dev pytest …) — the previous command silently skipped under the base venv. In the repo's shared CI it still skips via importorskip (no scoring dep installed), same as tbench2's camel guard.
step() now counts only scored steps (increment moved after the reset guard).
app.py: commented the load-bearing package-dir remap the container import relies on.

On the reward: float narrowing of base Observation.reward (Union[bool,int,float,None]): confirmed no construction path passes reward=None — serialize_observation strips reward and the client reconstructs from the float we emit — so the narrowing can't raise a ValidationError.

The two framework-level alignment notes (serialize_observation stripping metadata; whether the framework should preserve it so every typed env gets observation.metadata for free) remain maintainer calls — verified against current main, the strip still happens. cc @Darktex

acharyaanusha and others added 11 commits June 10, 2026 19:13

feat(sophistry_bench_sprint_env): scaffold OpenEnv package + vendor s…

ebf13e5

…print wheel Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(sophistry_bench_sprint_env): add AdvocacyAction/AdvocacyObservat…

76ed010

…ion models Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(sophistry_bench_sprint_env): typed HTTP client

21b7952

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(sophistry_bench_sprint_env): environment construction + reset()

83465f0

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(sophistry_bench_sprint_env): step() scoring with canonical rewar…

b6d0355

…d parity Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(sophistry_bench_sprint_env): FastAPI app + Dockerfile

4ce6be8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(sophistry_bench_sprint_env): README + build/usage

d337a1d

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(sophistry_bench_sprint_env): error path survives wire + serializa…

230269a

…tion regression test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs+test(sophistry_bench_sprint_env): correct image tag; error wire-…

b419779

…survival test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Darktex requested changes Jun 11, 2026

View reviewed changes

acharyaanusha requested a review from Darktex June 11, 2026 18:06

Darktex reviewed Jun 11, 2026

View reviewed changes

Darktex reviewed Jun 12, 2026

View reviewed changes

acharyaanusha requested a review from Darktex June 12, 2026 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment#787

Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment#787
acharyaanusha wants to merge 13 commits into
huggingface:mainfrom
acharyaanusha:feature/sophistry_bench_sprint_env

acharyaanusha commented Jun 11, 2026 •

edited

Loading

Uh oh!

acharyaanusha commented Jun 11, 2026

Uh oh!

Darktex left a comment

Uh oh!

acharyaanusha commented Jun 11, 2026 •

edited

Loading

Uh oh!

Darktex left a comment

Uh oh!

Darktex left a comment

Uh oh!

acharyaanusha commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

acharyaanusha commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

A serialization fix worth flagging for the framework

Test Plan

Dependency & live demo

Uh oh!

acharyaanusha commented Jun 11, 2026

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Review: sophistry_bench_sprint_env

Tier 1 — Fixes required (CI-blocking)

Tier 1 — Smaller fixes

Tier 2 — Alignment / for human review

Claims verified against the code

Verdict

Uh oh!

acharyaanusha commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Suggestions (non-blocking)

Alignment notes for a human reviewer

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Tier 1 (Bugs & Lint) — non-blocking

Tier 2 (Alignment)

Uh oh!

acharyaanusha commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acharyaanusha commented Jun 11, 2026 •

edited

Loading

Review: `sophistry_bench_sprint_env`

acharyaanusha commented Jun 11, 2026 •

edited

Loading

acharyaanusha commented Jun 12, 2026 •

edited

Loading