Kass/test cases batch 2#695
Conversation
Three orthogonal additions that together let the unified runner host tests over real conversational data: - `AddMessagesFromFixtureAction` -- ingest messages from a fixture JSON under tests/unified/test_cases/data/ rather than inline. Maps fixture role=user / role=assistant to configured peer IDs; preserves per- message timestamps when present. - `continue_on_failure` flag on TestDefinition -- when set, per-step assertion failures collect rather than abort, so suites with N independent scenarios aggregate per-session pass/fail. Infrastructure errors still abort. - `SaveArtifactAction` -- write a query result (get_representation / get_peer_card / get_context) to a named JSON file under a per-test artifact dir. No assertion coupling, so offline analysis is decoupled from the pass/fail gate. Per-test artifact dir is at tests/unified/artifacts/<test_name>_<utc_ts>/ (override with UNIFIED_TEST_ARTIFACTS_DIR; gitignored). Each test run also writes _run_metadata.json capturing honcho git SHA + dirty status, transport + model for each LLM-using agent, SHA-256 hashes of relevant prompt source files, judge model, and timestamp. Manual snapshot-replay procedure documented in tests/unified/README.md; automatic replay deferred.
L1 reproducer for the deriver_no_memory_signal failure mode: the
deriver writes observations that don't add signal about the user or
their work -- bare acknowledgments paraphrased as facts ("user said
yes"), shell-history fragments ("user executed cd ~/foo"), instruction-
event-records that look like durable user attributes but aren't.
Inputs: 8 sessions sampled from SALT-NLP/SWE-chat (ODC-BY,
https://arxiv.org/abs/2604.20779), persona-balanced across Expert
Nitpicker / Vague Requester / Mind Changer. Two sessions from the
underlying extract are excluded for data quality (one Japanese with
a heavily skewed user/assistant role split; one near-monologue);
documented in the fixture description.
Per session: ingest messages, wait for queue empty, run an llm_judge
asking whether the user's representation contains any Tier 1 bare-
speech-act observation, then save_artifact-dump the representation
for offline analysis. The suite uses continue_on_failure to aggregate
per-session pass/fail. Judge prompt is calibrated with annotated
example observations across three tiers (Tier 1 / Tier 2 / Signal) to
distinguish bare-event records from specifics-anchored observations.
Inputs file at tests/unified/test_cases/data/swe_chat_sample_persona_balanced.json
-- the committed dataset extract. Metadata embedded in the file
documents its provenance (SALT-NLP/SWE-chat, ODC-BY, selection
criteria, extraction timestamp).
L1 reproducer for the dreamer's failure to consolidate duplicate
observations. The dreamer's documented role is consolidation:
rewriting / merging semantically related observations into better-
stated ones, and deleting redundant observations. This fixture
asserts the consolidation specifically against duplicates --
exact-text duplicates, near-duplicates (varying only in trivial
details), and semantic duplicates (the same fact stated from
different angles).
Inputs: 3 sessions sampled from SALT-NLP/SWE-chat (indices 5, 6, 7
from swe_chat_sample_persona_balanced.json). Each session is
ingested through the deriver, then dreamed, then queried for
residual duplicates. Two-gate structure per session:
- PRE-DREAM SETUP GATE: assert >=2 duplicate observations exist
before the dream is scheduled. If the deriver did not produce
duplicates, the test errors at the setup gate rather than
silently passing the post-dream gate on no inputs.
- POST-DREAM MAIN GATE (strict v1): assert ZERO duplicate
observations remain after the dream. Strict gate; relative
thresholds (post-dream count < pre-dream count) could be added
later via runner-level cross-step state.
Judge prompts are calibrated with three flavors of duplicate
patterns (exact-content duplicates, near-twins of a single shell
command differing only by embedded timestamp, and opposite-polarity
semantic twins describing the same convention).
…ixture L1 reproducer for the failure mode in the wiki failure-modes catalog: the agent misinterprets the user's statement; the user explicitly corrects; both the wrong claim (from the agent's misread) and the corrected claim end up in storage with no supersession marker. End- state shape resembles time_drift_storage_no_supersession but the trigger is agent misinterpretation rather than user mind-change, and the wrong claim is one the user never literally stated. Setup: user_a clearly scopes their work to extracting a rate-limiter subsystem from a Python payment-service monolith into a standalone Go microservice. Assistant misreads as "full payment service migration to Go." User immediately corrects and reaffirms the narrow scope. Subsequent turns develop substantive technical content (gRPC RPCs, Redis token-bucket, Lua atomicity, circuit breaker trade-offs, testing strategy). The wrong claim never appears in a user turn -- only in the assistant's misread. Two-gate structure: positive control asserts the representation captures the correct-scope work; negative reproducer asserts the representation does NOT contain the assistant's misinterpretation (claims about migrating the whole payment service or rewriting the Python monolith). Both gates start at pass_if=true (invariant); per verify-first methodology, polarity flips to pass_if=false only after observed firing on real code.
WalkthroughThis PR introduces end-to-end artifact capture and failure-mode reproducer test infrastructure. It extends the unified test schema with fixture-loading and error-continuation capabilities, implements metadata/artifact serialization in the executor, adds new step handlers, refines ChangesUnified Test Infrastructure and Fixtures
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
.gitignore (1)
10-12: ⚡ Quick winUnignore fixture contents explicitly, not just the directory.
Given the broad
dataignore rule, add an explicit recursive unignore so new fixture files under this directory are always tracked.✅ Suggested fix
!tests/unified/test_cases/data/ +!tests/unified/test_cases/data/**🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.gitignore around lines 10 - 12, The .gitignore currently ignores the top-level data directory but only unignores the directory path '!tests/unified/test_cases/data/' which does not unignore files inside; update the unignore to be recursive so fixture files are tracked (i.e. replace or add a recursive unignore for tests/unified/test_cases/data such that files under that directory are not ignored while keeping the broader 'data' and 'redis-data' ignore rules intact).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unified/runner.py`:
- Around line 514-520: The code currently builds fixture_path from FIXTURES_DIR
/ step.fixture_path and opens it directly, allowing path traversal (e.g., "../")
to escape the fixture root; fix by resolving the combined path (use
Path.resolve(strict=False) on FIXTURES_DIR / step.fixture_path) and verify the
resolved path is a child of FIXTURES_DIR.resolve() (e.g., compare parents or
string prefix) and raise FileNotFoundError or ValueError if it is outside the
fixture root before proceeding to open the file (references: FIXTURES_DIR,
step.fixture_path, fixture_path, and the with open(...) call).
- Around line 523-531: The runtime should reject negative fixture indices and
non-positive limits: before using step.fixture_session_index against
sessions_in_fixture, add a check that step.fixture_session_index is >= 0 and
raise an IndexError with a clear message if not (currently only upper bound is
checked); similarly, before slicing fixture_messages using step.limit, validate
that if step.limit is not None then step.limit > 0 and raise a ValueError (or
appropriate exception) for limit <= 0 with a descriptive message; update the
checks around fixture_session = sessions_in_fixture[...] and fixture_messages =
fixture_session.get("messages", []) to enforce these validations.
- Around line 346-347: Three f-strings are implicitly concatenated and must be
joined explicitly: replace the adjacent f-strings in the message built with
test_name and assertion_failures so they are joined with +; similarly join the
two adjacent f-strings that reference fixture_path, step.fixture_path, and
FIXTURES_DIR with +; and join the two adjacent f-strings that reference
step.fixture_session_index and sessions_in_fixture (len(sessions_in_fixture))
with +. Locate these by the symbols test_name, assertion_failures, fixture_path,
FIXTURES_DIR, step.fixture_path, step.fixture_session_index, and
sessions_in_fixture in tests/unified/runner.py and change the implicit
concatenations to use the + operator to form a single string expression.
In `@tests/unified/schema.py`:
- Around line 83-95: The fixture_session_index and limit fields currently allow
negative/zero values; update their Pydantic Field declarations to enforce
bounds: change fixture_session_index to require ge=0 (non-negative integer) and
change limit to require ge=1 (positive integer) by adding the appropriate Field
constraints (e.g., Field(..., ge=0, description=...) for fixture_session_index
and Field(None, ge=1, description=...) for limit) so invalid indices or limits
are rejected at validation time; ensure you keep the existing descriptions on
the Field calls.
---
Nitpick comments:
In @.gitignore:
- Around line 10-12: The .gitignore currently ignores the top-level data
directory but only unignores the directory path
'!tests/unified/test_cases/data/' which does not unignore files inside; update
the unignore to be recursive so fixture files are tracked (i.e. replace or add a
recursive unignore for tests/unified/test_cases/data such that files under that
directory are not ignored while keeping the broader 'data' and 'redis-data'
ignore rules intact).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 798cb782-e304-4602-90f0-17df0aeb026e
📒 Files selected for processing (8)
.gitignoretests/unified/README.mdtests/unified/runner.pytests/unified/schema.pytests/unified/test_cases/data/swe_chat_sample_persona_balanced.jsontests/unified/test_cases/deriver_no_memory_signal.jsontests/unified/test_cases/dreamer_no_dedup.jsontests/unified/test_cases/misinterpretation_correction_creates_orphan_fact.json
| f"Test {test_name} FAILED with {len(assertion_failures)} " | ||
| f"assertion failure(s) (continue_on_failure=true):" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the exact warning locations and then recheck after patching.
rg -n "assertion failure\\(s\\)|resolved from|out of range" tests/unified/runner.pyRepository: plastic-labs/honcho
Length of output: 304
🏁 Script executed:
sed -n '345,348p' tests/unified/runner.pyRepository: plastic-labs/honcho
Length of output: 246
🏁 Script executed:
sed -n '516,519p' tests/unified/runner.pyRepository: plastic-labs/honcho
Length of output: 283
🏁 Script executed:
sed -n '524,527p' tests/unified/runner.pyRepository: plastic-labs/honcho
Length of output: 276
Fix implicit string concatenation at all three locations to comply with basedpyright's reportImplicitStringConcatenation check.
These lines use adjacent f-strings without explicit operators, which triggers CI warnings:
- Lines 346-347:
f"Test {test_name} FAILED with {len(assertion_failures)} " f"assertion failure(s) (continue_on_failure=true):" - Lines 517-518:
f"Fixture not found: {fixture_path} (resolved from " f"fixture_path={step.fixture_path!r}, FIXTURES_DIR={FIXTURES_DIR})" - Lines 525-526:
f"fixture_session_index={step.fixture_session_index} out of range " f"(fixture has {len(sessions_in_fixture)} sessions)"
Use explicit + operator to join the strings.
🧰 Tools
🪛 GitHub Actions: Static Analysis / 0_basedpyright.txt
[warning] 346-346: basedpyright warning: Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🪛 GitHub Actions: Static Analysis / basedpyright
[warning] 346-346: basedpyright: Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🪛 GitHub Check: basedpyright
[warning] 346-346:
Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unified/runner.py` around lines 346 - 347, Three f-strings are
implicitly concatenated and must be joined explicitly: replace the adjacent
f-strings in the message built with test_name and assertion_failures so they are
joined with +; similarly join the two adjacent f-strings that reference
fixture_path, step.fixture_path, and FIXTURES_DIR with +; and join the two
adjacent f-strings that reference step.fixture_session_index and
sessions_in_fixture (len(sessions_in_fixture)) with +. Locate these by the
symbols test_name, assertion_failures, fixture_path, FIXTURES_DIR,
step.fixture_path, step.fixture_session_index, and sessions_in_fixture in
tests/unified/runner.py and change the implicit concatenations to use the +
operator to form a single string expression.
| fixture_path = FIXTURES_DIR / step.fixture_path | ||
| if not fixture_path.is_file(): | ||
| raise FileNotFoundError( | ||
| f"Fixture not found: {fixture_path} (resolved from " | ||
| f"fixture_path={step.fixture_path!r}, FIXTURES_DIR={FIXTURES_DIR})" | ||
| ) | ||
| with open(fixture_path) as f: |
There was a problem hiding this comment.
Validate fixture paths stay under tests/unified/test_cases/data.
fixture_path is concatenated directly with FIXTURES_DIR, so values like ../... can escape the fixture root and read arbitrary files.
✅ Suggested fix
elif isinstance(step, AddMessagesFromFixtureAction):
- fixture_path = FIXTURES_DIR / step.fixture_path
+ fixture_path = (FIXTURES_DIR / step.fixture_path).resolve()
+ fixtures_root = FIXTURES_DIR.resolve()
+ if fixtures_root not in fixture_path.parents:
+ raise ValueError(
+ f"Invalid fixture_path {step.fixture_path!r}: must resolve under {fixtures_root}"
+ )
if not fixture_path.is_file():
raise FileNotFoundError(🧰 Tools
🪛 GitHub Actions: Static Analysis / 0_basedpyright.txt
[warning] 517-517: basedpyright warning: Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🪛 GitHub Actions: Static Analysis / basedpyright
[warning] 517-517: basedpyright: Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🪛 GitHub Check: basedpyright
[warning] 517-517:
Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unified/runner.py` around lines 514 - 520, The code currently builds
fixture_path from FIXTURES_DIR / step.fixture_path and opens it directly,
allowing path traversal (e.g., "../") to escape the fixture root; fix by
resolving the combined path (use Path.resolve(strict=False) on FIXTURES_DIR /
step.fixture_path) and verify the resolved path is a child of
FIXTURES_DIR.resolve() (e.g., compare parents or string prefix) and raise
FileNotFoundError or ValueError if it is outside the fixture root before
proceeding to open the file (references: FIXTURES_DIR, step.fixture_path,
fixture_path, and the with open(...) call).
| if step.fixture_session_index >= len(sessions_in_fixture): | ||
| raise IndexError( | ||
| f"fixture_session_index={step.fixture_session_index} out of range " | ||
| f"(fixture has {len(sessions_in_fixture)} sessions)" | ||
| ) | ||
| fixture_session = sessions_in_fixture[step.fixture_session_index] | ||
| fixture_messages = fixture_session.get("messages", []) | ||
| if step.limit is not None: | ||
| fixture_messages = fixture_messages[: step.limit] |
There was a problem hiding this comment.
Reject negative fixture indices and non-positive limits at runtime.
Line 523 only validates the upper bound; negative fixture_session_index still loads from the end. Line 530 allows limit <= 0, which yields surprising slices.
✅ Suggested fix
sessions_in_fixture = fixture.get("sessions", [])
+ if step.fixture_session_index < 0:
+ raise IndexError(
+ f"fixture_session_index must be >= 0; got {step.fixture_session_index}"
+ )
if step.fixture_session_index >= len(sessions_in_fixture):
raise IndexError(
@@
fixture_messages = fixture_session.get("messages", [])
if step.limit is not None:
+ if step.limit <= 0:
+ raise ValueError(f"limit must be > 0; got {step.limit}")
fixture_messages = fixture_messages[: step.limit]🧰 Tools
🪛 GitHub Actions: Static Analysis / 0_basedpyright.txt
[warning] 525-525: basedpyright warning: Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🪛 GitHub Actions: Static Analysis / basedpyright
[warning] 525-525: basedpyright: Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🪛 GitHub Check: basedpyright
[warning] 525-525:
Implicit string concatenation not allowed (reportImplicitStringConcatenation)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unified/runner.py` around lines 523 - 531, The runtime should reject
negative fixture indices and non-positive limits: before using
step.fixture_session_index against sessions_in_fixture, add a check that
step.fixture_session_index is >= 0 and raise an IndexError with a clear message
if not (currently only upper bound is checked); similarly, before slicing
fixture_messages using step.limit, validate that if step.limit is not None then
step.limit > 0 and raise a ValueError (or appropriate exception) for limit <= 0
with a descriptive message; update the checks around fixture_session =
sessions_in_fixture[...] and fixture_messages = fixture_session.get("messages",
[]) to enforce these validations.
| fixture_session_index: int = Field( | ||
| ..., description="Index into fixture['sessions'] selecting which session to load" | ||
| ) | ||
| user_peer_id: str = Field( | ||
| ..., description="Honcho peer ID to attribute fixture role='user' messages to" | ||
| ) | ||
| assistant_peer_id: str | None = Field( | ||
| None, | ||
| description="Honcho peer ID for fixture role='assistant' messages. If omitted, assistant messages are skipped.", | ||
| ) | ||
| limit: int | None = Field( | ||
| None, description="Cap on number of conversational turns to ingest" | ||
| ) |
There was a problem hiding this comment.
Add bounds to fixture index and limit fields.
fixture_session_index and limit accept negative/non-positive values right now. At runtime, this can silently select unintended sessions or truncate turns unexpectedly.
✅ Suggested fix
class AddMessagesFromFixtureAction(TestStep):
@@
- fixture_session_index: int = Field(
- ..., description="Index into fixture['sessions'] selecting which session to load"
- )
+ fixture_session_index: int = Field(
+ ...,
+ ge=0,
+ description="Index into fixture['sessions'] selecting which session to load",
+ )
@@
- limit: int | None = Field(
- None, description="Cap on number of conversational turns to ingest"
- )
+ limit: int | None = Field(
+ None, gt=0, description="Cap on number of conversational turns to ingest"
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unified/schema.py` around lines 83 - 95, The fixture_session_index and
limit fields currently allow negative/zero values; update their Pydantic Field
declarations to enforce bounds: change fixture_session_index to require ge=0
(non-negative integer) and change limit to require ge=1 (positive integer) by
adding the appropriate Field constraints (e.g., Field(..., ge=0,
description=...) for fixture_session_index and Field(None, ge=1,
description=...) for limit) so invalid indices or limits are rejected at
validation time; ensure you keep the existing descriptions on the Field calls.
Three L1 failure-mode reproducers for the unified runner (DEV-1698):
misinterpretation_correction_creates_orphan_fact— syntheticderiver_no_memory_signal— 8 SWE-chat sessionsdreamer_no_dedup— 3 SWE-chat sessionsRunner additions:
AddMessagesFromFixtureAction,SaveArtifactAction,continue_on_failure, and_run_metadata.jsonfor replay (procedure intests/unified/README.md).All gates start at
pass_if=trueper verify-first methodology. SWE-chat sample underdata/is SALT-NLP/SWE-chat (ODC-BY); attribution embedded in the file.Summary by CodeRabbit
Tests
Documentation
Chores