feat: scenarios v2.1#31
Merged
Merged
Conversation
Delivers the 5-step pipeline, ack/pipeline-complete sentinels, factory negative example + integrity check, and web notifications support. Bump is needed so Claude Code's semver cache picks this over the stale 1.4.0 that currently wins. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copies new files from Autonoma-AI/test-planner-plugin upstream main: - hooks/preflight_scenario_recipes.py - hooks/validators/validate_scenario_recipes.py - hooks/validators/validate_discover.py - hooks/validators/validate_sdk_endpoint.py - hooks/validators/validate_scenario_validation.py - tests/test_preflight_scenario_recipes.py (+ 4 other test files) No conflicts since these are new files. All 119 tests pass. Agent prompts + SKILL.md wiring to use these will follow in separate commits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports upstream's recipe subsystem into the fork's KB-first architecture
(Step 5 scenario-validator is the home for recipe emission).
Agents
- scenario-generator.md: add entity-audit.md as authoritative schema source,
scoping analysis, variable_fields + testRunId slugging strategy, nested
tree constraint, expanded frontmatter (variable_fields, planning_sections).
- test-case-generator.md: require {{token}} placeholders for variable fields,
prohibit meta-tests that "audit" fixture contents.
- scenario-validator.md: after all scenarios pass, emit
autonoma/scenario-recipes.json (nested tree + variables), run
preflight_scenario_recipes.py, write autonoma/.scenario-validation.json
terminal artifact, then write the existing .endpoint-validated sentinel.
Orchestration (SKILL.md + commands/generate-tests.md)
- Step 5 enforces status=ok + preflightPassed=true, re-runs preflight at the
orchestrator gate, and uploads the recipes to
/v1/setup/setups/{generationId}/scenario-recipe-versions.
- Step 6 prompt parses variable_fields and enforces {{token}} usage.
Validators
- validate-pipeline-output.sh: route scenario-recipes.json and
.scenario-validation.json to their validators.
- validate_scenarios.py: require variable_fields + planning_sections
(skipping upstream's discover.json requirement since Step 3 runs before
SDK integration in fork's flow).
- Test fixture updated accordingly.
All 119 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scenarios, implement, and e2e-tests were renamed on the docs site to match the actual 6-step plugin flow. Add a fetch for step-5-validate.txt so the scenario-validator subagent sees the live validation doc too.
Post-mortem of a real run showed the agent implementing
defineFactory({ create: db.<model>.create() }) despite the existing
"never reimplement inline" directive. Root cause: the prompt had 4
structural information gaps.
1. needs_extraction: true models had no actionable guidance — the agent
saw the flag, skipped extraction, and did db.create().
2. No policy for external side effects (Temporal, GitHub, analytics,
LLMs) — agents avoided real functions by bypassing them.
3. DI guidance only mentioned ctx.executor. Services needing logger,
event bus, or multi-dep composition had no recipe.
4. The prohibition was phrased softly ("99% of cases") and not paired
with a clear always-available alternative.
Fixes:
- Add a per-model decision tree keyed on needs_extraction that mandates
extraction-before-wiring, with a concrete Better Auth example.
- Add an external side effects policy: preserve DB state (including
sibling writes from ORM/framework hooks), not every network call.
- Add a 5-step DI / constructor-injection playbook covering top-level
imports, static methods, simple instance methods, composition-root
reuse, and "stop and ask" as the terminal branch — never db.create().
- Reword the prohibition as absolute: db.<model>.create() inside a
factory for a has_creation_code: true model is NEVER acceptable.
- Extend the factory-integrity check to HALT if any needs_extraction
flag remains in the audit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Post-mortem of the vienna-v1 run (44 of 50 factories contained inline db.<model>.create despite the strengthened prompt) produced two direct action items from the failing agent itself: 1. The self-policed factory-integrity check is not enough. Past agents run out of context, hand off to continuation agents that focus on TypeScript errors, and never re-run the grep. The only mechanical fix is a hook-level validator that blocks the .endpoint-implemented sentinel when violations exist. 2. Agents make ONE bad decision and apply it uniformly. Forcing them to pause and document a per-model decision table (with "file opened? import path? DI deps? Branch 1/2/3?") surfaces the give-up moment that previously happened silently. Changes: - New validators/validate_endpoint_implemented.py parses entity-audit.md frontmatter, locates the handler via the sentinel body, extracts every defineFactory(...) block with a brace balancer, finds each factory's create() body, and greps for inline ORM writes (prisma./db./tx./ctx.executor. patterns plus Drizzle tx.insert()). Exits 2 with a detailed Claude-facing error when any has_creation_code true model's factory contains an anti-pattern, or when any such model has no factory at all. Verified against the vienna-v1 handler: flags all 44 violations. - validate-pipeline-output.sh runs the validator on .endpoint-implemented writes and propagates exit 2, blocking the sentinel and forcing the agent to fix factories before advancing. - env-factory-generator.md adds a mandatory "Research pass" section requiring autonoma/.factory-plan.md with a per-model table (Model / Audit function / File opened / Import path / DI deps / Decision branch / Notes) before any factory is written. Includes the exact grep patterns the agent itself said would have prevented the silent DI give-up, and a reminder of the external-side-effects policy covering Temporal, BetterAuth hooks, billing, and analytics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Forces the plugin cache to re-populate so the new validator (hooks/validators/validate_endpoint_implemented.py) and the mandatory research-pass section in agents/env-factory-generator.md ship to every installed instance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third-run post-mortem surfaced two new failure modes the prompt alone cannot prevent: 1. Agent creates a standalone Hono server in apps/api/src/autonoma/ and binds its own port instead of mounting the handler on the main app. 2. When the factory-integrity check blocks .endpoint-implemented, the agent games the check by mass-flipping has_creation_code: true -> false in entity-audit.md so the grep has nothing to match. Fixes: - validate_endpoint_implemented.py gains check_handler_mount() that detects standalone serve()/listen()/createServer() in the handler's directory and verifies the handler is imported from outside its own directory. - validate_endpoint_implemented.py gains check_audit_flip() that diffs the current audit against a Step 2 snapshot and fails if more than AUTONOMA_AUDIT_FLIP_CAP models (default 5) flipped from true -> false. - validate-pipeline-output.sh snapshots entity-audit.md to .entity-audit-step2.md when .step-2-ack is written (i.e. the moment the user confirms the audit is accepted). Tested against vienna-v1: all three checks (ORM inlining, standalone server, handler-not-mounted) fire with actionable error messages; the audit-flip check correctly detects 43/54 flips against a synthesized snapshot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Run 4 bypass (fake extraction): agent moved bare ORM inserts into a helper file under the handler directory and rewrote the entity-audit's creation_file column to point at the helper. Every heuristic hook passed because the audit had been rewritten to match the stub. Three-gate architecture replacing the single heuristic gate: Gate 1 (existing) — cheap syntactic: grep for inline ORM, mount checks, audit-flip cap. Gate 2 (new) — creation_file immutability: for every model with has_creation_code: true in both the Step 2 snapshot and the current audit, refuse any change to creation_file. Extractions must record extracted_to instead of overwriting ground truth. Catches the audit-rewrite attack at near-zero cost. Gate 3 (new) — semantic LLM fan-out: fetches the factory-fidelity rubric from the quarita docs at runtime, spawns one `claude -p` subprocess per has_creation_code: true model (bounded concurrency), compiles per-criterion feedback, and blocks the sentinel on any hard failure. Reads code, not patterns, so factoring evasions do not help. Rubric lives in docs so it can be tuned without cutting a plugin release. Env knobs: AUTONOMA_FIDELITY_CONCURRENCY (6), AUTONOMA_FIDELITY_TIMEOUT (180), AUTONOMA_FIDELITY_MAX_MODELS (60), AUTONOMA_FIDELITY_MODEL (sonnet), AUTONOMA_SKIP_FIDELITY (escape hatch). Gate 3 fails soft on missing docs URL, missing claude CLI, or transient subprocess errors — the cheap gates remain the primary guarantee and the semantic check is an additive safety net. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When Branch 1 lifts inline creation logic into a new exported function, the developers who later encounter it should be able to tell at a glance that it was extracted for Environment Factory reuse — not invented for it. Ask the agent to leave a 1–2 line comment above the new export explaining the why and pointing at autonoma/entity-audit.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Script + fixture set that exercises the live rubric (fetched from
AUTONOMA_DOCS_URL) against known-good and known-bad factory shapes. Each
fixture declares the expected verdict and, for failures, which criteria
should fail. Mismatch is a hard error.
Fixtures (all generic — no codebase-specific names):
- good_uses_service: factory calls <Model>Service.create — should pass
- good_thin_wrapper_after_extraction: Branch 1 extraction preserves every
side effect — should pass
- bad_raw_orm_in_factory: factory body is db.<model>.create — should fail
criteria 1 + 2
- bad_stub_helper_in_handler_dir: helper file in factory dir is a raw
insert with "no business logic" comment — should fail criteria 1, 2, 4
- bad_audit_rewrite_only: thin wrapper keeps side effects, but current
audit's creation_file was repointed at the helper — should fail
criterion 3 only
Run via `AUTONOMA_DOCS_URL=... python3 hooks/validators/evals/run_evals.py`.
Full suite confirmed against the rewritten rubric: 5/5 pass.
Useful when tuning the rubric: change rubric wording → rerun → see which
verdicts flipped. Easier than hunting regressions in live runs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the binary has_creation_code field with two orthogonal fields so
the audit can describe every way a model comes into existence:
- independently_created: bool — standalone creation path exists
- created_by: [{owner, via, why}] — owners that mint this model inline
Four states: pure root, dual (both true), pure dependent, invalid. Binary
classification collapsed dual models (have their own service AND are minted
inline by a parent transaction), forcing the downstream pipeline to either
fabricate a factory for a dependent or ignore the standalone path.
Changes:
- agents/entity-audit-generator.md: two-pass methodology + 4-state matrix
- agents/scenario-generator.md: per-model standalone-vs-via-owner choice
- agents/env-factory-generator.md: teardown decision tree + compat note
- hooks/validators/_audit_schema.py (new): shared compat shim
- hooks/validators/validate_{factory_fidelity,creation_file_immutable,entity_audit}.py:
filter via is_independently_created; enforce created_by invariants
- hooks/validators/evals/fixtures: dependent_skipped, dual_judged_on_standalone,
bad_missing_owner
- hooks/validators/evals/run_evals.py: handle skip verdict + audit-validator
fixtures (no LLM call needed)
Backwards-compat shim translates legacy has_creation_code on read, so
existing audits keep working. Plugin version 1.12.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Vienna-v1 run-through surfaced the env-factory agent oscillating 11+ times
between extract-helper (validator: "helper not provided") and raw-write
(validator: "use the Step 2 creation_function"). Two root causes:
1. find_helper() only detected the FIRST top-level return/await call and
resolved imports via simple suffix guessing — missed wrapper patterns
like `const r = await fn(...); return r.session;` and TS path aliases.
Replaced with find_helpers() that collects every identifier called in
the factory block, filters via the handler's named imports, and resolves
through tsconfig.json compilerOptions.paths.
2. The rubric's Criterion 1 demands the factory "reach the Step 2
creation_function" — impossible when that function is a framework hook
(Better-Auth databaseHooks, NextAuth callbacks, inline route closures)
that only fires via the framework's own entry point. Added an explicit
carve-out: when Step 2 records `needs_extraction: true`, Criterion 1
passes iff the factory calls `extracted_to` and that function preserves
the hook's call chain.
Also relaxed validate_entity_audit.py: factory_count drift is now autofixed
with a stderr warning instead of blocking the pipeline (previously caused
a 6-cycle loop in vienna-v1 where the agent oscillated the count field).
New prompt template placeholders {{NEEDS_EXTRACTION}} / {{EXTRACTED_TO}} and
a new `error` verdict distinguish genuine missing context from true failures.
Three new eval fixtures cover the carve-out:
- framework_hook_extraction_pass.json — expected pass
- framework_hook_raw_write_fail.json — expected fail (Criteria 1 + 4)
- helper_unresolvable_errors.json — expected error (not fail)
Bump plugin to 1.13.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…f removing Branch 1 instructions previously told the agent to REMOVE needs_extraction: true after extracting. The v1.13.0 fidelity rubric's framework-hook carve-out reads both needs_extraction and extracted_to to score factories against the helper rather than the un-callable hook, so the field must stay set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… agents/commands/skills Scenario-validator agent, generate-tests command/skill, and validate-pipeline-output.sh comments still referenced the legacy audit field. Eval fixtures and validator code that intentionally exercise the v1 legacy schema (compat shim) are left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
snake_to_pascal auto-derives model names without pluralization. The agent was populating tableNameMap as a 1:1 mirror of the factory registry, doubling the maintenance surface. New section 4b gives a 5-step algorithm: only add entries where the factory key disagrees with the auto-derived name, and omit the map entirely when it would be empty. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reconciles this branch's Scenarios v2.1 + env-factory hardening work with upstream's two additive changes: 1. chiara-ciriani/adhoc-planner (generate-adhoc-tests command) 2. PR Autonoma-AI#28 "SDK integration step & hardened pipeline" Resolution strategy: - Ours wins for every file that belongs to the existing pipeline (env-factory-generator, scenario-generator, scenario-validator, test-case-generator, generate-tests command/skill, validate-pipeline-output.sh, validate_scenarios.py and its test). This branch updated each of them as part of v2.1; upstream's parallel edits are superseded. - Upstream wins for the additive adhoc-tests surface, which is independent of the main pipeline and happens to spawn the existing `scenario-generator` and `env-factory-generator` subagents by name: * agents/focused-test-case-generator.md * commands/generate-adhoc-tests.md * skills/generate-adhoc-tests/SKILL.md - Upstream's env-factory replacement (sdk-integrator.md + validate_sdk_integration.py + test_validate_sdk_integration.py) is dropped. The adhoc command explicitly calls the legacy `env-factory-generator` agent, so reintroducing sdk-integrator here would create two parallel agents for the same role. - tests/test_validate_pipeline_output.py was written against upstream's 135-line hook; our hook is a 318-line superset with lifecycle emission, so the upstream test is dropped rather than kept as a false signal. - plugin.json pinned at 1.13.1 (ours); release-please config from upstream is kept. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.