Skip to content

feat: scenarios v2.1#31

Merged
tomaspiaggio merged 33 commits into
Autonoma-AI:mainfrom
tomaspiaggio:main
Apr 23, 2026
Merged

feat: scenarios v2.1#31
tomaspiaggio merged 33 commits into
Autonoma-AI:mainfrom
tomaspiaggio:main

Conversation

@tomaspiaggio
Copy link
Copy Markdown
Contributor

No description provided.

tomaspiaggio and others added 30 commits April 15, 2026 12:30
Delivers the 5-step pipeline, ack/pipeline-complete sentinels, factory
negative example + integrity check, and web notifications support.
Bump is needed so Claude Code's semver cache picks this over the stale
1.4.0 that currently wins.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copies new files from Autonoma-AI/test-planner-plugin upstream main:
- hooks/preflight_scenario_recipes.py
- hooks/validators/validate_scenario_recipes.py
- hooks/validators/validate_discover.py
- hooks/validators/validate_sdk_endpoint.py
- hooks/validators/validate_scenario_validation.py
- tests/test_preflight_scenario_recipes.py (+ 4 other test files)

No conflicts since these are new files. All 119 tests pass.
Agent prompts + SKILL.md wiring to use these will follow in separate commits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports upstream's recipe subsystem into the fork's KB-first architecture
(Step 5 scenario-validator is the home for recipe emission).

Agents
- scenario-generator.md: add entity-audit.md as authoritative schema source,
  scoping analysis, variable_fields + testRunId slugging strategy, nested
  tree constraint, expanded frontmatter (variable_fields, planning_sections).
- test-case-generator.md: require {{token}} placeholders for variable fields,
  prohibit meta-tests that "audit" fixture contents.
- scenario-validator.md: after all scenarios pass, emit
  autonoma/scenario-recipes.json (nested tree + variables), run
  preflight_scenario_recipes.py, write autonoma/.scenario-validation.json
  terminal artifact, then write the existing .endpoint-validated sentinel.

Orchestration (SKILL.md + commands/generate-tests.md)
- Step 5 enforces status=ok + preflightPassed=true, re-runs preflight at the
  orchestrator gate, and uploads the recipes to
  /v1/setup/setups/{generationId}/scenario-recipe-versions.
- Step 6 prompt parses variable_fields and enforces {{token}} usage.

Validators
- validate-pipeline-output.sh: route scenario-recipes.json and
  .scenario-validation.json to their validators.
- validate_scenarios.py: require variable_fields + planning_sections
  (skipping upstream's discover.json requirement since Step 3 runs before
  SDK integration in fork's flow).
- Test fixture updated accordingly.

All 119 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scenarios, implement, and e2e-tests were renamed on the docs site to
match the actual 6-step plugin flow. Add a fetch for step-5-validate.txt
so the scenario-validator subagent sees the live validation doc too.
Post-mortem of a real run showed the agent implementing
defineFactory({ create: db.<model>.create() }) despite the existing
"never reimplement inline" directive. Root cause: the prompt had 4
structural information gaps.

1. needs_extraction: true models had no actionable guidance — the agent
   saw the flag, skipped extraction, and did db.create().
2. No policy for external side effects (Temporal, GitHub, analytics,
   LLMs) — agents avoided real functions by bypassing them.
3. DI guidance only mentioned ctx.executor. Services needing logger,
   event bus, or multi-dep composition had no recipe.
4. The prohibition was phrased softly ("99% of cases") and not paired
   with a clear always-available alternative.

Fixes:
- Add a per-model decision tree keyed on needs_extraction that mandates
  extraction-before-wiring, with a concrete Better Auth example.
- Add an external side effects policy: preserve DB state (including
  sibling writes from ORM/framework hooks), not every network call.
- Add a 5-step DI / constructor-injection playbook covering top-level
  imports, static methods, simple instance methods, composition-root
  reuse, and "stop and ask" as the terminal branch — never db.create().
- Reword the prohibition as absolute: db.<model>.create() inside a
  factory for a has_creation_code: true model is NEVER acceptable.
- Extend the factory-integrity check to HALT if any needs_extraction
  flag remains in the audit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Post-mortem of the vienna-v1 run (44 of 50 factories contained inline
db.<model>.create despite the strengthened prompt) produced two direct
action items from the failing agent itself:

1. The self-policed factory-integrity check is not enough. Past agents
   run out of context, hand off to continuation agents that focus on
   TypeScript errors, and never re-run the grep. The only mechanical
   fix is a hook-level validator that blocks the .endpoint-implemented
   sentinel when violations exist.

2. Agents make ONE bad decision and apply it uniformly. Forcing them
   to pause and document a per-model decision table (with "file opened?
   import path? DI deps? Branch 1/2/3?") surfaces the give-up moment
   that previously happened silently.

Changes:

- New validators/validate_endpoint_implemented.py parses entity-audit.md
  frontmatter, locates the handler via the sentinel body, extracts
  every defineFactory(...) block with a brace balancer, finds each
  factory's create() body, and greps for inline ORM writes
  (prisma./db./tx./ctx.executor. patterns plus Drizzle tx.insert()).
  Exits 2 with a detailed Claude-facing error when any has_creation_code
  true model's factory contains an anti-pattern, or when any such model
  has no factory at all. Verified against the vienna-v1 handler: flags
  all 44 violations.
- validate-pipeline-output.sh runs the validator on
  .endpoint-implemented writes and propagates exit 2, blocking the
  sentinel and forcing the agent to fix factories before advancing.
- env-factory-generator.md adds a mandatory "Research pass" section
  requiring autonoma/.factory-plan.md with a per-model table (Model /
  Audit function / File opened / Import path / DI deps / Decision
  branch / Notes) before any factory is written. Includes the exact
  grep patterns the agent itself said would have prevented the silent
  DI give-up, and a reminder of the external-side-effects policy
  covering Temporal, BetterAuth hooks, billing, and analytics.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Forces the plugin cache to re-populate so the new validator
(hooks/validators/validate_endpoint_implemented.py) and the mandatory
research-pass section in agents/env-factory-generator.md ship to every
installed instance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third-run post-mortem surfaced two new failure modes the prompt alone
cannot prevent:

1. Agent creates a standalone Hono server in apps/api/src/autonoma/ and
   binds its own port instead of mounting the handler on the main app.
2. When the factory-integrity check blocks .endpoint-implemented, the
   agent games the check by mass-flipping has_creation_code: true -> false
   in entity-audit.md so the grep has nothing to match.

Fixes:
- validate_endpoint_implemented.py gains check_handler_mount() that
  detects standalone serve()/listen()/createServer() in the handler's
  directory and verifies the handler is imported from outside its own
  directory.
- validate_endpoint_implemented.py gains check_audit_flip() that diffs
  the current audit against a Step 2 snapshot and fails if more than
  AUTONOMA_AUDIT_FLIP_CAP models (default 5) flipped from true -> false.
- validate-pipeline-output.sh snapshots entity-audit.md to
  .entity-audit-step2.md when .step-2-ack is written (i.e. the moment
  the user confirms the audit is accepted).

Tested against vienna-v1: all three checks (ORM inlining, standalone
server, handler-not-mounted) fire with actionable error messages; the
audit-flip check correctly detects 43/54 flips against a synthesized
snapshot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Run 4 bypass (fake extraction): agent moved bare ORM inserts into a helper
file under the handler directory and rewrote the entity-audit's creation_file
column to point at the helper. Every heuristic hook passed because the audit
had been rewritten to match the stub.

Three-gate architecture replacing the single heuristic gate:

Gate 1 (existing) — cheap syntactic: grep for inline ORM, mount checks,
audit-flip cap.

Gate 2 (new) — creation_file immutability: for every model with
has_creation_code: true in both the Step 2 snapshot and the current audit,
refuse any change to creation_file. Extractions must record extracted_to
instead of overwriting ground truth. Catches the audit-rewrite attack at
near-zero cost.

Gate 3 (new) — semantic LLM fan-out: fetches the factory-fidelity rubric
from the quarita docs at runtime, spawns one `claude -p` subprocess per
has_creation_code: true model (bounded concurrency), compiles per-criterion
feedback, and blocks the sentinel on any hard failure. Reads code, not
patterns, so factoring evasions do not help. Rubric lives in docs so it can
be tuned without cutting a plugin release.

Env knobs: AUTONOMA_FIDELITY_CONCURRENCY (6), AUTONOMA_FIDELITY_TIMEOUT
(180), AUTONOMA_FIDELITY_MAX_MODELS (60), AUTONOMA_FIDELITY_MODEL (sonnet),
AUTONOMA_SKIP_FIDELITY (escape hatch).

Gate 3 fails soft on missing docs URL, missing claude CLI, or transient
subprocess errors — the cheap gates remain the primary guarantee and the
semantic check is an additive safety net.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When Branch 1 lifts inline creation logic into a new exported function, the
developers who later encounter it should be able to tell at a glance that it
was extracted for Environment Factory reuse — not invented for it.

Ask the agent to leave a 1–2 line comment above the new export explaining
the why and pointing at autonoma/entity-audit.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Script + fixture set that exercises the live rubric (fetched from
AUTONOMA_DOCS_URL) against known-good and known-bad factory shapes. Each
fixture declares the expected verdict and, for failures, which criteria
should fail. Mismatch is a hard error.

Fixtures (all generic — no codebase-specific names):
  - good_uses_service: factory calls <Model>Service.create — should pass
  - good_thin_wrapper_after_extraction: Branch 1 extraction preserves every
    side effect — should pass
  - bad_raw_orm_in_factory: factory body is db.<model>.create — should fail
    criteria 1 + 2
  - bad_stub_helper_in_handler_dir: helper file in factory dir is a raw
    insert with "no business logic" comment — should fail criteria 1, 2, 4
  - bad_audit_rewrite_only: thin wrapper keeps side effects, but current
    audit's creation_file was repointed at the helper — should fail
    criterion 3 only

Run via `AUTONOMA_DOCS_URL=... python3 hooks/validators/evals/run_evals.py`.
Full suite confirmed against the rewritten rubric: 5/5 pass.

Useful when tuning the rubric: change rubric wording → rerun → see which
verdicts flipped. Easier than hunting regressions in live runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the binary has_creation_code field with two orthogonal fields so
the audit can describe every way a model comes into existence:

- independently_created: bool — standalone creation path exists
- created_by: [{owner, via, why}] — owners that mint this model inline

Four states: pure root, dual (both true), pure dependent, invalid. Binary
classification collapsed dual models (have their own service AND are minted
inline by a parent transaction), forcing the downstream pipeline to either
fabricate a factory for a dependent or ignore the standalone path.

Changes:
- agents/entity-audit-generator.md: two-pass methodology + 4-state matrix
- agents/scenario-generator.md: per-model standalone-vs-via-owner choice
- agents/env-factory-generator.md: teardown decision tree + compat note
- hooks/validators/_audit_schema.py (new): shared compat shim
- hooks/validators/validate_{factory_fidelity,creation_file_immutable,entity_audit}.py:
  filter via is_independently_created; enforce created_by invariants
- hooks/validators/evals/fixtures: dependent_skipped, dual_judged_on_standalone,
  bad_missing_owner
- hooks/validators/evals/run_evals.py: handle skip verdict + audit-validator
  fixtures (no LLM call needed)

Backwards-compat shim translates legacy has_creation_code on read, so
existing audits keep working. Plugin version 1.12.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Vienna-v1 run-through surfaced the env-factory agent oscillating 11+ times
between extract-helper (validator: "helper not provided") and raw-write
(validator: "use the Step 2 creation_function"). Two root causes:

1. find_helper() only detected the FIRST top-level return/await call and
   resolved imports via simple suffix guessing — missed wrapper patterns
   like `const r = await fn(...); return r.session;` and TS path aliases.
   Replaced with find_helpers() that collects every identifier called in
   the factory block, filters via the handler's named imports, and resolves
   through tsconfig.json compilerOptions.paths.
2. The rubric's Criterion 1 demands the factory "reach the Step 2
   creation_function" — impossible when that function is a framework hook
   (Better-Auth databaseHooks, NextAuth callbacks, inline route closures)
   that only fires via the framework's own entry point. Added an explicit
   carve-out: when Step 2 records `needs_extraction: true`, Criterion 1
   passes iff the factory calls `extracted_to` and that function preserves
   the hook's call chain.

Also relaxed validate_entity_audit.py: factory_count drift is now autofixed
with a stderr warning instead of blocking the pipeline (previously caused
a 6-cycle loop in vienna-v1 where the agent oscillated the count field).

New prompt template placeholders {{NEEDS_EXTRACTION}} / {{EXTRACTED_TO}} and
a new `error` verdict distinguish genuine missing context from true failures.

Three new eval fixtures cover the carve-out:
- framework_hook_extraction_pass.json — expected pass
- framework_hook_raw_write_fail.json  — expected fail (Criteria 1 + 4)
- helper_unresolvable_errors.json     — expected error (not fail)

Bump plugin to 1.13.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…f removing

Branch 1 instructions previously told the agent to REMOVE needs_extraction: true
after extracting. The v1.13.0 fidelity rubric's framework-hook carve-out reads
both needs_extraction and extracted_to to score factories against the helper
rather than the un-callable hook, so the field must stay set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tomaspiaggio and others added 2 commits April 20, 2026 18:29
… agents/commands/skills

Scenario-validator agent, generate-tests command/skill, and validate-pipeline-output.sh
comments still referenced the legacy audit field. Eval fixtures and validator code
that intentionally exercise the v1 legacy schema (compat shim) are left untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
snake_to_pascal auto-derives model names without pluralization. The
agent was populating tableNameMap as a 1:1 mirror of the factory
registry, doubling the maintenance surface. New section 4b gives a
5-step algorithm: only add entries where the factory key disagrees with
the auto-derived name, and omit the map entirely when it would be empty.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@sponja23 sponja23 changed the title feat: sceanrios v2.1 feat: scenarios v2.1 Apr 22, 2026
Reconciles this branch's Scenarios v2.1 + env-factory hardening work
with upstream's two additive changes:

  1. chiara-ciriani/adhoc-planner (generate-adhoc-tests command)
  2. PR Autonoma-AI#28 "SDK integration step & hardened pipeline"

Resolution strategy:

  - Ours wins for every file that belongs to the existing pipeline
    (env-factory-generator, scenario-generator, scenario-validator,
    test-case-generator, generate-tests command/skill,
    validate-pipeline-output.sh, validate_scenarios.py and its test).
    This branch updated each of them as part of v2.1; upstream's
    parallel edits are superseded.

  - Upstream wins for the additive adhoc-tests surface, which is
    independent of the main pipeline and happens to spawn the
    existing `scenario-generator` and `env-factory-generator`
    subagents by name:
      * agents/focused-test-case-generator.md
      * commands/generate-adhoc-tests.md
      * skills/generate-adhoc-tests/SKILL.md

  - Upstream's env-factory replacement (sdk-integrator.md +
    validate_sdk_integration.py + test_validate_sdk_integration.py)
    is dropped. The adhoc command explicitly calls the legacy
    `env-factory-generator` agent, so reintroducing sdk-integrator
    here would create two parallel agents for the same role.

  - tests/test_validate_pipeline_output.py was written against
    upstream's 135-line hook; our hook is a 318-line superset with
    lifecycle emission, so the upstream test is dropped rather than
    kept as a false signal.

  - plugin.json pinned at 1.13.1 (ours); release-please config from
    upstream is kept.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tomaspiaggio tomaspiaggio merged commit 85719b5 into Autonoma-AI:main Apr 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant