feat: scenarios v2.1 by tomaspiaggio · Pull Request #31 · Autonoma-AI/test-planner-plugin

tomaspiaggio · 2026-04-22T01:33:31Z

No description provided.

…or not

Delivers the 5-step pipeline, ack/pipeline-complete sentinels, factory negative example + integrity check, and web notifications support. Bump is needed so Claude Code's semver cache picks this over the stale 1.4.0 that currently wins. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copies new files from Autonoma-AI/test-planner-plugin upstream main: - hooks/preflight_scenario_recipes.py - hooks/validators/validate_scenario_recipes.py - hooks/validators/validate_discover.py - hooks/validators/validate_sdk_endpoint.py - hooks/validators/validate_scenario_validation.py - tests/test_preflight_scenario_recipes.py (+ 4 other test files) No conflicts since these are new files. All 119 tests pass. Agent prompts + SKILL.md wiring to use these will follow in separate commits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Ports upstream's recipe subsystem into the fork's KB-first architecture (Step 5 scenario-validator is the home for recipe emission). Agents - scenario-generator.md: add entity-audit.md as authoritative schema source, scoping analysis, variable_fields + testRunId slugging strategy, nested tree constraint, expanded frontmatter (variable_fields, planning_sections). - test-case-generator.md: require {{token}} placeholders for variable fields, prohibit meta-tests that "audit" fixture contents. - scenario-validator.md: after all scenarios pass, emit autonoma/scenario-recipes.json (nested tree + variables), run preflight_scenario_recipes.py, write autonoma/.scenario-validation.json terminal artifact, then write the existing .endpoint-validated sentinel. Orchestration (SKILL.md + commands/generate-tests.md) - Step 5 enforces status=ok + preflightPassed=true, re-runs preflight at the orchestrator gate, and uploads the recipes to /v1/setup/setups/{generationId}/scenario-recipe-versions. - Step 6 prompt parses variable_fields and enforces {{token}} usage. Validators - validate-pipeline-output.sh: route scenario-recipes.json and .scenario-validation.json to their validators. - validate_scenarios.py: require variable_fields + planning_sections (skipping upstream's discover.json requirement since Step 3 runs before SDK integration in fork's flow). - Test fixture updated accordingly. All 119 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

scenarios, implement, and e2e-tests were renamed on the docs site to match the actual 6-step plugin flow. Add a fetch for step-5-validate.txt so the scenario-validator subagent sees the live validation doc too.

Post-mortem of a real run showed the agent implementing defineFactory({ create: db.<model>.create() }) despite the existing "never reimplement inline" directive. Root cause: the prompt had 4 structural information gaps. 1. needs_extraction: true models had no actionable guidance — the agent saw the flag, skipped extraction, and did db.create(). 2. No policy for external side effects (Temporal, GitHub, analytics, LLMs) — agents avoided real functions by bypassing them. 3. DI guidance only mentioned ctx.executor. Services needing logger, event bus, or multi-dep composition had no recipe. 4. The prohibition was phrased softly ("99% of cases") and not paired with a clear always-available alternative. Fixes: - Add a per-model decision tree keyed on needs_extraction that mandates extraction-before-wiring, with a concrete Better Auth example. - Add an external side effects policy: preserve DB state (including sibling writes from ORM/framework hooks), not every network call. - Add a 5-step DI / constructor-injection playbook covering top-level imports, static methods, simple instance methods, composition-root reuse, and "stop and ask" as the terminal branch — never db.create(). - Reword the prohibition as absolute: db.<model>.create() inside a factory for a has_creation_code: true model is NEVER acceptable. - Extend the factory-integrity check to HALT if any needs_extraction flag remains in the audit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Post-mortem of the vienna-v1 run (44 of 50 factories contained inline db.<model>.create despite the strengthened prompt) produced two direct action items from the failing agent itself: 1. The self-policed factory-integrity check is not enough. Past agents run out of context, hand off to continuation agents that focus on TypeScript errors, and never re-run the grep. The only mechanical fix is a hook-level validator that blocks the .endpoint-implemented sentinel when violations exist. 2. Agents make ONE bad decision and apply it uniformly. Forcing them to pause and document a per-model decision table (with "file opened? import path? DI deps? Branch 1/2/3?") surfaces the give-up moment that previously happened silently. Changes: - New validators/validate_endpoint_implemented.py parses entity-audit.md frontmatter, locates the handler via the sentinel body, extracts every defineFactory(...) block with a brace balancer, finds each factory's create() body, and greps for inline ORM writes (prisma./db./tx./ctx.executor. patterns plus Drizzle tx.insert()). Exits 2 with a detailed Claude-facing error when any has_creation_code true model's factory contains an anti-pattern, or when any such model has no factory at all. Verified against the vienna-v1 handler: flags all 44 violations. - validate-pipeline-output.sh runs the validator on .endpoint-implemented writes and propagates exit 2, blocking the sentinel and forcing the agent to fix factories before advancing. - env-factory-generator.md adds a mandatory "Research pass" section requiring autonoma/.factory-plan.md with a per-model table (Model / Audit function / File opened / Import path / DI deps / Decision branch / Notes) before any factory is written. Includes the exact grep patterns the agent itself said would have prevented the silent DI give-up, and a reminder of the external-side-effects policy covering Temporal, BetterAuth hooks, billing, and analytics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Forces the plugin cache to re-populate so the new validator (hooks/validators/validate_endpoint_implemented.py) and the mandatory research-pass section in agents/env-factory-generator.md ship to every installed instance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Third-run post-mortem surfaced two new failure modes the prompt alone cannot prevent: 1. Agent creates a standalone Hono server in apps/api/src/autonoma/ and binds its own port instead of mounting the handler on the main app. 2. When the factory-integrity check blocks .endpoint-implemented, the agent games the check by mass-flipping has_creation_code: true -> false in entity-audit.md so the grep has nothing to match. Fixes: - validate_endpoint_implemented.py gains check_handler_mount() that detects standalone serve()/listen()/createServer() in the handler's directory and verifies the handler is imported from outside its own directory. - validate_endpoint_implemented.py gains check_audit_flip() that diffs the current audit against a Step 2 snapshot and fails if more than AUTONOMA_AUDIT_FLIP_CAP models (default 5) flipped from true -> false. - validate-pipeline-output.sh snapshots entity-audit.md to .entity-audit-step2.md when .step-2-ack is written (i.e. the moment the user confirms the audit is accepted). Tested against vienna-v1: all three checks (ORM inlining, standalone server, handler-not-mounted) fire with actionable error messages; the audit-flip check correctly detects 43/54 flips against a synthesized snapshot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Run 4 bypass (fake extraction): agent moved bare ORM inserts into a helper file under the handler directory and rewrote the entity-audit's creation_file column to point at the helper. Every heuristic hook passed because the audit had been rewritten to match the stub. Three-gate architecture replacing the single heuristic gate: Gate 1 (existing) — cheap syntactic: grep for inline ORM, mount checks, audit-flip cap. Gate 2 (new) — creation_file immutability: for every model with has_creation_code: true in both the Step 2 snapshot and the current audit, refuse any change to creation_file. Extractions must record extracted_to instead of overwriting ground truth. Catches the audit-rewrite attack at near-zero cost. Gate 3 (new) — semantic LLM fan-out: fetches the factory-fidelity rubric from the quarita docs at runtime, spawns one `claude -p` subprocess per has_creation_code: true model (bounded concurrency), compiles per-criterion feedback, and blocks the sentinel on any hard failure. Reads code, not patterns, so factoring evasions do not help. Rubric lives in docs so it can be tuned without cutting a plugin release. Env knobs: AUTONOMA_FIDELITY_CONCURRENCY (6), AUTONOMA_FIDELITY_TIMEOUT (180), AUTONOMA_FIDELITY_MAX_MODELS (60), AUTONOMA_FIDELITY_MODEL (sonnet), AUTONOMA_SKIP_FIDELITY (escape hatch). Gate 3 fails soft on missing docs URL, missing claude CLI, or transient subprocess errors — the cheap gates remain the primary guarantee and the semantic check is an additive safety net. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When Branch 1 lifts inline creation logic into a new exported function, the developers who later encounter it should be able to tell at a glance that it was extracted for Environment Factory reuse — not invented for it. Ask the agent to leave a 1–2 line comment above the new export explaining the why and pointing at autonoma/entity-audit.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Script + fixture set that exercises the live rubric (fetched from AUTONOMA_DOCS_URL) against known-good and known-bad factory shapes. Each fixture declares the expected verdict and, for failures, which criteria should fail. Mismatch is a hard error. Fixtures (all generic — no codebase-specific names): - good_uses_service: factory calls <Model>Service.create — should pass - good_thin_wrapper_after_extraction: Branch 1 extraction preserves every side effect — should pass - bad_raw_orm_in_factory: factory body is db.<model>.create — should fail criteria 1 + 2 - bad_stub_helper_in_handler_dir: helper file in factory dir is a raw insert with "no business logic" comment — should fail criteria 1, 2, 4 - bad_audit_rewrite_only: thin wrapper keeps side effects, but current audit's creation_file was repointed at the helper — should fail criterion 3 only Run via `AUTONOMA_DOCS_URL=... python3 hooks/validators/evals/run_evals.py`. Full suite confirmed against the rewritten rubric: 5/5 pass. Useful when tuning the rubric: change rubric wording → rerun → see which verdicts flipped. Easier than hunting regressions in live runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replaces the binary has_creation_code field with two orthogonal fields so the audit can describe every way a model comes into existence: - independently_created: bool — standalone creation path exists - created_by: [{owner, via, why}] — owners that mint this model inline Four states: pure root, dual (both true), pure dependent, invalid. Binary classification collapsed dual models (have their own service AND are minted inline by a parent transaction), forcing the downstream pipeline to either fabricate a factory for a dependent or ignore the standalone path. Changes: - agents/entity-audit-generator.md: two-pass methodology + 4-state matrix - agents/scenario-generator.md: per-model standalone-vs-via-owner choice - agents/env-factory-generator.md: teardown decision tree + compat note - hooks/validators/_audit_schema.py (new): shared compat shim - hooks/validators/validate_{factory_fidelity,creation_file_immutable,entity_audit}.py: filter via is_independently_created; enforce created_by invariants - hooks/validators/evals/fixtures: dependent_skipped, dual_judged_on_standalone, bad_missing_owner - hooks/validators/evals/run_evals.py: handle skip verdict + audit-validator fixtures (no LLM call needed) Backwards-compat shim translates legacy has_creation_code on read, so existing audits keep working. Plugin version 1.12.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Vienna-v1 run-through surfaced the env-factory agent oscillating 11+ times between extract-helper (validator: "helper not provided") and raw-write (validator: "use the Step 2 creation_function"). Two root causes: 1. find_helper() only detected the FIRST top-level return/await call and resolved imports via simple suffix guessing — missed wrapper patterns like `const r = await fn(...); return r.session;` and TS path aliases. Replaced with find_helpers() that collects every identifier called in the factory block, filters via the handler's named imports, and resolves through tsconfig.json compilerOptions.paths. 2. The rubric's Criterion 1 demands the factory "reach the Step 2 creation_function" — impossible when that function is a framework hook (Better-Auth databaseHooks, NextAuth callbacks, inline route closures) that only fires via the framework's own entry point. Added an explicit carve-out: when Step 2 records `needs_extraction: true`, Criterion 1 passes iff the factory calls `extracted_to` and that function preserves the hook's call chain. Also relaxed validate_entity_audit.py: factory_count drift is now autofixed with a stderr warning instead of blocking the pipeline (previously caused a 6-cycle loop in vienna-v1 where the agent oscillated the count field). New prompt template placeholders {{NEEDS_EXTRACTION}} / {{EXTRACTED_TO}} and a new `error` verdict distinguish genuine missing context from true failures. Three new eval fixtures cover the carve-out: - framework_hook_extraction_pass.json — expected pass - framework_hook_raw_write_fail.json — expected fail (Criteria 1 + 4) - helper_unresolvable_errors.json — expected error (not fail) Bump plugin to 1.13.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…f removing Branch 1 instructions previously told the agent to REMOVE needs_extraction: true after extracting. The v1.13.0 fidelity rubric's framework-hook carve-out reads both needs_extraction and extracted_to to score factories against the helper rather than the un-callable hook, so the field must stay set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… agents/commands/skills Scenario-validator agent, generate-tests command/skill, and validate-pipeline-output.sh comments still referenced the legacy audit field. Eval fixtures and validator code that intentionally exercise the v1 legacy schema (compat shim) are left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

snake_to_pascal auto-derives model names without pluralization. The agent was populating tableNameMap as a 1:1 mirror of the factory registry, doubling the maintenance surface. New section 4b gives a 5-step algorithm: only add entries where the factory key disagrees with the auto-derived name, and omit the map entirely when it would be empty. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Reconciles this branch's Scenarios v2.1 + env-factory hardening work with upstream's two additive changes: 1. chiara-ciriani/adhoc-planner (generate-adhoc-tests command) 2. PR Autonoma-AI#28 "SDK integration step & hardened pipeline" Resolution strategy: - Ours wins for every file that belongs to the existing pipeline (env-factory-generator, scenario-generator, scenario-validator, test-case-generator, generate-tests command/skill, validate-pipeline-output.sh, validate_scenarios.py and its test). This branch updated each of them as part of v2.1; upstream's parallel edits are superseded. - Upstream wins for the additive adhoc-tests surface, which is independent of the main pipeline and happens to spawn the existing `scenario-generator` and `env-factory-generator` subagents by name: * agents/focused-test-case-generator.md * commands/generate-adhoc-tests.md * skills/generate-adhoc-tests/SKILL.md - Upstream's env-factory replacement (sdk-integrator.md + validate_sdk_integration.py + test_validate_sdk_integration.py) is dropped. The adhoc command explicitly calls the legacy `env-factory-generator` agent, so reintroducing sdk-integrator here would create two parallel agents for the same role. - tests/test_validate_pipeline_output.py was written against upstream's 135-line hook; our hook is a 318-line superset with lifecycle emission, so the upstream test is dropped rather than kept as a false signal. - plugin.json pinned at 1.13.1 (ours); release-please config from upstream is kept. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tomaspiaggio and others added 30 commits April 15, 2026 12:30

feat: sceanrios v2.1

9763c8f

feat: added a 2nd step that validates if we need to use repositories …

f1eb5ff

…or not

feat: added validator for step 2

ddd1991

feat: made docs env to reduce hallucinations when running locally

48e7feb

feat: changed the docs to be fully env

4bd608f

feat: now the agent doesn't send the updates. now it's a hook

ad2e1c6

feat: added canary deployments

ddae568

feat: plugin streaming logs

f0fa4fd

fix: annotations

3cf2f39

feat: debugging stream

e5b4d87

fix: pretool-heartbeat

353725e

fix: stream

58d0076

feat: validation on implementation

ec379bc

fix: added hook at the end to finish the pipeline

81e0def

fix: new step for validation

2b512a2

feat: better breaks for scenarios v2.1 implementations

fc7004d

docs: point orchestrator fetches at renamed 6-step doc slugs

8a33115

scenarios, implement, and e2e-tests were renamed on the docs site to match the actual 6-step plugin flow. Add a fetch for step-5-validate.txt so the scenario-validator subagent sees the live validation doc too.

tomaspiaggio and others added 2 commits April 20, 2026 18:29

claude Bot reviewed Apr 22, 2026

View reviewed changes

sponja23 changed the title ~~feat: sceanrios v2.1~~ feat: scenarios v2.1 Apr 22, 2026

tomaspiaggio merged commit 85719b5 into Autonoma-AI:main Apr 23, 2026
1 check passed

tomaspiaggio mentioned this pull request Apr 23, 2026

release: promote main → production (v1.14.0) #35

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: scenarios v2.1#31

feat: scenarios v2.1#31
tomaspiaggio merged 33 commits into
Autonoma-AI:mainfrom
tomaspiaggio:main

tomaspiaggio commented Apr 22, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomaspiaggio commented Apr 22, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant