From 627baf9d6ff958461d25f01cf2426f8be416abb3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E7=8F=8A=E7=91=9A=E6=B5=B7?= <你的邮箱>
Date: Wed, 29 Apr 2026 20:24:26 +0800
Subject: [PATCH 1/2] feat: add agent eval failure packet

---
 README.md                            |  12 +-
 docs/AI_CODING_TOOL_FAILURE_NOTES.md |  97 ++++++++++++++
 docs/ANNOTATION_GUIDELINE.md         | 121 ++++++++++++++++++
 examples/README.md                   |  21 +++
 examples/agent_eval_cases.jsonl      |  30 +++++
 examples/verify_agent_eval_cases.py  | 183 +++++++++++++++++++++++++++
 tests/test_agent_eval_cases.py       |  31 +++++
 7 files changed, 494 insertions(+), 1 deletion(-)
 create mode 100644 docs/AI_CODING_TOOL_FAILURE_NOTES.md
 create mode 100644 docs/ANNOTATION_GUIDELINE.md
 create mode 100644 examples/agent_eval_cases.jsonl
 create mode 100644 examples/verify_agent_eval_cases.py
 create mode 100644 tests/test_agent_eval_cases.py

diff --git a/README.md b/README.md
index acc862d..c7cb783 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@
 [![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
 [![LLM overhead](https://img.shields.io/badge/LLM%20overhead-%3C1%25-brightgreen)](#performance)
 
-Latest verified release: [v0.1.1](https://github.com/yangfei222666-9/self-improving-loop/releases/tag/v0.1.1) · Agent data strategy: [AGENT_DATA_STRATEGY.md](AGENT_DATA_STRATEGY.md) · External repro: [EXTERNAL_REPRO.md](EXTERNAL_REPRO.md) · Launch copy: [English + 中文](LAUNCH_COPY_BILINGUAL.md) · Hermes-style guard: [docs/HERMES_SKILL_GUARD.md](docs/HERMES_SKILL_GUARD.md)
+Latest verified release: [v0.1.1](https://github.com/yangfei222666-9/self-improving-loop/releases/tag/v0.1.1) · Agent data strategy: [AGENT_DATA_STRATEGY.md](AGENT_DATA_STRATEGY.md) · Eval labels: [docs/ANNOTATION_GUIDELINE.md](docs/ANNOTATION_GUIDELINE.md) · External repro: [EXTERNAL_REPRO.md](EXTERNAL_REPRO.md) · Hermes-style guard: [docs/HERMES_SKILL_GUARD.md](docs/HERMES_SKILL_GUARD.md)
 
 中文定位：`self-improving-loop` 是 AI Agent 的回归保护层。它包住 LangGraph / Hermes / 自定义 agent 节点，记录 trace，检测成功率或延迟退化，回滚坏配置，并保留可复查事件证据。
 
@@ -155,6 +155,16 @@ python3 examples/regression_rollback_demo.py --data-dir .repro-demo
 python3 examples/verify_regression_rollback_event_trail.py .repro-demo/regression_rollback_event_trail.jsonl
 ```
 
+For the bundled agent-failure eval packet, run:
+
+```bash
+python3 examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
+```
+
+The packet contains 30 non-authorizing cases for silent failure, stale artifacts,
+provider drift, missing event trails, rollback gaps, and unsafe action
+escalation. It is eval data only: no judgment, paper-buy, trade, or promote.
+
 ---
 
 ## Use it as a safety layer for your current agent
diff --git a/docs/AI_CODING_TOOL_FAILURE_NOTES.md b/docs/AI_CODING_TOOL_FAILURE_NOTES.md
new file mode 100644
index 0000000..c20091f
--- /dev/null
+++ b/docs/AI_CODING_TOOL_FAILURE_NOTES.md
@@ -0,0 +1,97 @@
+# AI Coding Tool Failure Notes
+
+This note maps common Claude Code, Cursor, OpenClaw, Hermes, and custom coding
+agent failures into trace fields, failure labels, and guard actions.
+
+The package should be positioned as a reliability layer around these tools, not
+as a replacement for them.
+
+## Core Pattern
+
+```text
+agent action -> trace -> failure label -> verifier -> rollback or block
+```
+
+The important question is not whether the tool produced code. The important
+question is whether the task is complete, tested, and recoverable.
+
+## Common Failure Modes
+
+| Failure mode | Label | Guard action |
+| --- | --- | --- |
+| Patch generated but no matching test ran. | `patch_without_test` | Block promote; require verifier. |
+| Tool call says ok but file/state is unchanged. | `tool_call_noop` | Re-read target state; retry or block. |
+| Multi-turn repair drifts away from the original issue. | `context_drift` | Re-anchor to original task and stop current branch. |
+| Provider changes from intended model to fallback without disclosure. | `provider_route_drift` | Mark blocked until route is verified. |
+| Agent output is truncated or invalid JSON. | `partial_output_truncation` | Retry with bounded prompt or mark degraded. |
+| HTTP 200 response contains empty content. | `http_200_empty_output` | Treat as failure, not ok. |
+| Generated artifact is older than the current run. | `stale_artifact` | Block success claim; require fresh run id/hash. |
+| Review packet points to a different source artifact. | `artifact_hash_mismatch` | Block review; regenerate packet. |
+| Agent repeats the same request hash. | `duplicate_request` | Stop replay and record duplicate evidence. |
+| Regression detected but no rollback event exists. | `rollback_missing` | Block; require restore evidence. |
+| Learning output tries to authorize trade/promote/deploy. | `unsafe_action_escalation` | Hard block. |
+
+## Mapping to self-improving-loop
+
+`self-improving-loop` already provides the runtime seam:
+
+- Trace each callable execution.
+- Track success rate and latency.
+- Trigger a strategy hook when failure patterns cross a threshold.
+- Apply a guarded config patch.
+- Roll back when the patch regresses quality.
+- Preserve an event trail for audit.
+
+The eval packet layer adds the missing data strategy layer:
+
+- Convert traces into failure labels.
+- Separate hard signals from soft reviewer observations.
+- Decide whether the case is `learning_only`, `blocked`, or `manual_review_required`.
+- Prevent eval data from becoming execution authority.
+
+## Tool-Specific Notes
+
+### Claude Code / Cursor
+
+Typical risk: patch quality appears high, but the actual task may be unverified.
+
+Minimum guard:
+
+- Record changed files.
+- Record tests/verifiers run.
+- Record stdout/stderr.
+- Recalculate artifact hashes after patch.
+- Block if no test or verifier maps to the change.
+
+### OpenClaw
+
+Typical risk: a tool route is available in configuration but not actually
+healthy at runtime.
+
+Minimum guard:
+
+- Probe gateway health before trusting tool availability.
+- Record actual route, version, and failure reason.
+- Treat configured-but-unreachable tools as blocked, not degraded.
+
+### Hermes / Skill Runtimes
+
+Typical risk: skill config changes can degrade output while still returning a
+formally successful result.
+
+Minimum guard:
+
+- Store previous skill config.
+- Run a baseline task.
+- Apply the candidate patch.
+- Compare quality and latency.
+- Roll back if worse.
+- Verify event trail contains restore evidence.
+
+## Non-Goals
+
+This note does not claim model training, RL, or autonomous self-improvement.
+Before those claims are valid, the system needs sustained runs, accumulated
+cases, real promote/demote/archive behavior, and at least one learned rule that
+changes future behavior under audit.
+
diff --git a/docs/ANNOTATION_GUIDELINE.md b/docs/ANNOTATION_GUIDELINE.md
new file mode 100644
index 0000000..f5cdb4a
--- /dev/null
+++ b/docs/ANNOTATION_GUIDELINE.md
@@ -0,0 +1,121 @@
+# Agent Eval Annotation Guideline
+
+This guideline defines how to label agent workflow failures for regression
+guards, eval packets, and human review queues.
+
+It is not a benchmark claim. It is a conservative labeling policy for turning
+agent execution evidence into auditable data.
+
+## Goal
+
+Convert an agent run into a structured eval case:
+
+```text
+task -> trace -> artifact -> failure label -> routing verdict -> evidence
+```
+
+The labeler must not infer success from a final text answer alone. A run is only
+trusted when the required trace, artifact, and verifier evidence exist.
+
+## Required Fields
+
+Each eval case should capture:
+
+- `case_id`: stable unique identifier.
+- `domain`: workflow family, such as coding_agent, tool_calling, or provider_route.
+- `task_type`: what the agent was trying to do.
+- `agent_stack`: framework or runner shape, such as LangGraph, Hermes, OpenClaw, Cursor, Claude Code, or custom.
+- `prompt_summary`: short non-sensitive task summary.
+- `observed_failure`: what actually went wrong.
+- `failure_labels`: one or more controlled labels.
+- `signals.hard`: machine-checkable signals.
+- `signals.soft`: reviewer observations that are useful but not sufficient alone.
+- `trace`: provider, model, latency, success signal, artifact status, and event-flow status.
+- `expected_routing`: conservative routing decision.
+- `evidence_required`: artifacts needed before the case can be trusted.
+- `regression_guard_action`: guard action such as block, rollback, retry, or human review.
+
+## Label Set
+
+Use the narrowest labels that are supported by evidence.
+
+| Label | Meaning | Minimum evidence |
+| --- | --- | --- |
+| `silent_failure` | The workflow looks finished, but a required step did not actually complete. | Missing event, missing artifact, or verifier contradiction. |
+| `false_success` | The run reports success while evidence proves the target was not completed. | Success status plus failed verifier or stale/missing artifact. |
+| `stale_artifact` | Output was reused from an older run. | Timestamp, run id, source hash, or archive mismatch. |
+| `missing_event_trail` | The result lacks a parseable event trail. | Missing JSONL/trace or unreadable trace. |
+| `provider_route_drift` | The actual model/provider differs from the intended route. | Expected route and actual route disagree. |
+| `latency_regression` | Latency worsened beyond threshold. | Before/after latency metrics. |
+| `success_rate_regression` | Success rate dropped beyond threshold. | Before/after success metrics. |
+| `tool_call_noop` | A tool call returned ok but did not change the target state. | Tool result plus unchanged artifact/state. |
+| `patch_without_test` | Code was changed without a verifier or test proving behavior. | Diff exists, no matching test/verifier evidence. |
+| `context_drift` | Multi-turn repair no longer addresses the original task. | Original task and later action diverge. |
+| `http_200_empty_output` | Provider returned HTTP 200 but the usable output was empty or unparsable. | HTTP status plus parse failure or empty content. |
+| `rollback_missing` | Regression was detected but no rollback evidence exists. | Regression event without rollback event. |
+| `config_patch_regression` | A config/prompt/tool patch made quality worse. | Before/after quality or rollback trigger. |
+| `duplicate_request` | Same request hash/session was submitted twice and should be blocked. | Duplicate hash count or replay evidence. |
+| `partial_output_truncation` | Output is cut off and cannot support a complete verdict. | Truncation marker, invalid JSON, or missing required section. |
+| `artifact_hash_mismatch` | Review packet does not match the source artifact it claims to review. | Stored hash differs from recalculated hash. |
+| `unsafe_action_escalation` | Learning/review output tries to authorize a risky action. | `trade_allowed`, `paper_buy_allowed`, `promote_allowed`, or equivalent set true. |
+| `human_review_missing` | Human review was required but absent. | Review-required gate plus missing review artifact. |
+
+## Routing Policy
+
+Eval cases are not execution permission.
+
+Allowed routing values:
+
+- `learning_only`: safe to store as training/eval evidence only.
+- `blocked`: cannot be used until missing evidence or hard failure is fixed.
+- `manual_review_required`: human review can inspect it, but execution remains blocked.
+
+Forbidden in eval packets:
+
+- `judgment_allowed=true`
+- `paper_buy_allowed=true`
+- `trade_allowed=true`
+- `promote_allowed=true`
+
+If any forbidden flag appears in an eval packet, the packet verifier must return
+`blocked`.
+
+## Hard vs Soft Signals
+
+Hard signals are machine-checkable and can block automatically:
+
+- Missing artifact.
+- Hash mismatch.
+- Duplicate request hash.
+- Provider/model mismatch.
+- HTTP 200 with empty or invalid JSON.
+- Event trail missing or unparsable.
+- Test/verifier failure.
+
+Soft signals require review and cannot promote alone:
+
+- The answer feels off-topic.
+- The patch looks risky.
+- The wording is ambiguous.
+- The model gave a plausible but unverified explanation.
+
+## Training vs Eval vs Review
+
+Use cases this way:
+
+- Training candidate: stable label, safe content, no secrets, no execution authority.
+- Eval case: deterministic expected route and required evidence are known.
+- Human review: ambiguous, high-impact, or soft-signal-heavy cases.
+- Blocked: missing evidence, unsafe action escalation, or corrupted event flow.
+
+## Stop Rules
+
+Stop and mark `blocked` when:
+
+- The event trail is missing or unparsable.
+- The source artifact is missing.
+- A learning packet contains any execution authorization.
+- The provider route cannot be verified.
+- The case contains secrets or personal data.
+- The case relies on a stale output as fresh evidence.
+
diff --git a/examples/README.md b/examples/README.md
index 8240256..4e42af7 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -9,6 +9,7 @@ python examples/03_langgraph_adapter.py
 python examples/04_yijing_strategy.py
 python examples/05_langgraph_regression_guard.py
 python examples/06_hermes_skill_regression_guard.py
+python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
 ```
 
 ## 01_basic_tracking.py
@@ -62,3 +63,23 @@ Proves the Hermes-style skill seam without a Hermes dependency:
 
 Use this when someone asks how the package fits under Hermes, OpenClaw, or any
 skill-based agent runtime instead of competing with it.
+
+## agent_eval_cases.jsonl
+
+Provides 30 non-authorizing eval cases for coding-agent, tool-calling,
+provider-route, stale-artifact, rollback, and governance failures.
+
+Verify the packet:
+
+```bash
+python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
+```
+
+Expected boundary:
+
+```text
+judgment_allowed=false
+paper_buy_allowed=false
+trade_allowed=false
+promote_allowed=false
+```
diff --git a/examples/agent_eval_cases.jsonl b/examples/agent_eval_cases.jsonl
new file mode 100644
index 0000000..c6ca65e
--- /dev/null
+++ b/examples/agent_eval_cases.jsonl
@@ -0,0 +1,30 @@
+{"case_id":"AF-001","domain":"coding_agent","task_type":"patch_verification","agent_stack":"Claude Code","prompt_summary":"Fix a failing parser test","observed_failure":"Patch was generated but no test mapped to the parser behavior was run.","failure_labels":["patch_without_test"],"signals":{"hard":["diff_present","missing_test_evidence"],"soft":["patch_plausible"]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"claude_code","model":"unknown","latency_ms":18400,"success_signal":"tool_reported_ok"},"expected_routing":{"verdict":"manual_review_required","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["diff","test_command","test_output","artifact_hash"],"regression_guard_action":["block_promote","request_targeted_test"]}
+{"case_id":"AF-002","domain":"coding_agent","task_type":"tool_call","agent_stack":"Cursor","prompt_summary":"Rename a config key across the repo","observed_failure":"Tool call returned ok, but repository search still found the old config key.","failure_labels":["tool_call_noop","false_success"],"signals":{"hard":["tool_status_ok","state_unchanged"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"cursor_agent","model":"unknown","latency_ms":9200,"success_signal":"tool_reported_ok"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["post_action_search","changed_file_list","tool_result"],"regression_guard_action":["block_success_claim","rerun_state_check"]}
+{"case_id":"AF-003","domain":"provider_route","task_type":"model_selection","agent_stack":"custom","prompt_summary":"Run a high-trust review with a required model","observed_failure":"Actual provider route drifted to a fallback model while the summary still claimed the required model.","failure_labels":["provider_route_drift","false_success"],"signals":{"hard":["expected_model_mismatch","actual_route_recorded"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"fallback_provider","model":"fallback-small","latency_ms":3500,"success_signal":"review_text_present"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["expected_route","actual_route","provider_probe"],"regression_guard_action":["block_judgment","require_live_probe"]}
+{"case_id":"AF-004","domain":"artifact_freshness","task_type":"daily_digest","agent_stack":"cron","prompt_summary":"Generate the latest daily digest","observed_failure":"Digest file existed but carried an older run id than the current workflow.","failure_labels":["stale_artifact","false_success"],"signals":{"hard":["mtime_old","run_id_mismatch"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local","model":"none","latency_ms":0,"success_signal":"file_exists"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["artifact_mtime","artifact_run_id","workflow_run_id"],"regression_guard_action":["clear_stale_artifact","block_latest_alias"]}
+{"case_id":"AF-005","domain":"event_flow","task_type":"agent_review","agent_stack":"custom","prompt_summary":"Review an agent output for promotion","observed_failure":"Review text was present, but no event trail could be parsed.","failure_labels":["missing_event_trail"],"signals":{"hard":["event_log_missing"],"soft":["review_looks_complete"]},"trace":{"event_flow_present":false,"artifact_present":true,"provider_route":"unknown","model":"unknown","latency_ms":0,"success_signal":"review_markdown_exists"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["event_flow_jsonl","review_artifact"],"regression_guard_action":["block_completion_claim","require_event_flow"]}
+{"case_id":"AF-006","domain":"provider_output","task_type":"structured_review","agent_stack":"DeepSeek API","prompt_summary":"Return JSON verdict for a learning case","observed_failure":"HTTP status was 200, but the content was empty and JSON parsing failed.","failure_labels":["http_200_empty_output","partial_output_truncation"],"signals":{"hard":["http_status_200","empty_content","json_parse_failed"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"deepseek_official","model":"deepseek-chat","latency_ms":11800,"success_signal":"http_200"},"expected_routing":{"verdict":"learning_only","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["raw_response_status","raw_response_body","parse_error"],"regression_guard_action":["retry_once","record_degraded_attempt"]}
+{"case_id":"AF-007","domain":"rollback","task_type":"config_patch","agent_stack":"Hermes","prompt_summary":"Apply a skill config improvement","observed_failure":"Regression was detected after the patch, but no rollback restore event was recorded.","failure_labels":["rollback_missing","config_patch_regression"],"signals":{"hard":["quality_worse","missing_rollback_event"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local_skill","model":"none","latency_ms":2100,"success_signal":"patch_applied"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["baseline_metrics","patched_metrics","rollback_event"],"regression_guard_action":["block_patch","restore_previous_config"]}
+{"case_id":"AF-008","domain":"request_dedup","task_type":"review_smoke","agent_stack":"custom","prompt_summary":"Run the same review smoke twice","observed_failure":"The same request hash appeared twice in one session.","failure_labels":["duplicate_request"],"signals":{"hard":["duplicate_hash_count_gt_1"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local","model":"none","latency_ms":0,"success_signal":"second_request_received"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["session_id","request_hash","request_count"],"regression_guard_action":["drop_duplicate","record_duplicate_evidence"]}
+{"case_id":"AF-009","domain":"coding_agent","task_type":"multi_turn_repair","agent_stack":"Cursor","prompt_summary":"Fix CI after a lint failure","observed_failure":"Later turns started optimizing unrelated README wording instead of fixing the CI failure.","failure_labels":["context_drift"],"signals":{"hard":["original_failure_unresolved"],"soft":["later_patch_off_topic"]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"cursor_agent","model":"unknown","latency_ms":27000,"success_signal":"patch_created"},"expected_routing":{"verdict":"manual_review_required","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["original_failure","final_diff","test_output"],"regression_guard_action":["stop_branch","reanchor_to_original_task"]}
+{"case_id":"AF-010","domain":"artifact_integrity","task_type":"review_packet","agent_stack":"custom","prompt_summary":"Verify a manual review packet","observed_failure":"The review packet source hash did not match the current action sheet hash.","failure_labels":["artifact_hash_mismatch","stale_artifact"],"signals":{"hard":["hash_mismatch"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local","model":"none","latency_ms":0,"success_signal":"review_packet_exists"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["stored_hash","recalculated_hash","source_path"],"regression_guard_action":["regenerate_review_packet","block_review"]}
+{"case_id":"AF-011","domain":"safe_action","task_type":"learning_packet","agent_stack":"custom","prompt_summary":"Verify a learning-only capsule","observed_failure":"The capsule output included trade_allowed=true.","failure_labels":["unsafe_action_escalation"],"signals":{"hard":["trade_allowed_true"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local","model":"none","latency_ms":0,"success_signal":"capsule_output_present"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["capsule_output","event_flow"],"regression_guard_action":["hard_block","strip_execution_authority"]}
+{"case_id":"AF-012","domain":"human_review","task_type":"paper_candidate","agent_stack":"custom","prompt_summary":"Route a candidate with consecutive appearances","observed_failure":"The system required manual review, but no review artifact existed.","failure_labels":["human_review_missing"],"signals":{"hard":["manual_review_required","review_artifact_missing"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"local","model":"none","latency_ms":0,"success_signal":"candidate_present"},"expected_routing":{"verdict":"manual_review_required","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["manual_review_artifact","review_event_flow"],"regression_guard_action":["block_execution","generate_review_packet"]}
+{"case_id":"AF-013","domain":"latency","task_type":"agent_node","agent_stack":"LangGraph","prompt_summary":"Run a planning node after prompt patch","observed_failure":"Latency increased by 35 percent after the patch while success stayed flat.","failure_labels":["latency_regression","config_patch_regression"],"signals":{"hard":["latency_gain_gt_threshold"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"langgraph_node","model":"none","latency_ms":1350,"success_signal":"node_success"},"expected_routing":{"verdict":"learning_only","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["before_latency","after_latency","patch_id"],"regression_guard_action":["rollback_if_quality_not_improved","record_regression"]}
+{"case_id":"AF-014","domain":"success_rate","task_type":"agent_node","agent_stack":"LangGraph","prompt_summary":"Evaluate a routing prompt patch","observed_failure":"Success rate dropped from 92 percent to 78 percent after the patch.","failure_labels":["success_rate_regression","config_patch_regression"],"signals":{"hard":["success_rate_drop_gt_threshold"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"langgraph_node","model":"none","latency_ms":900,"success_signal":"mixed_results"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["before_success_rate","after_success_rate","rollback_event"],"regression_guard_action":["rollback_patch","block_promote"]}
+{"case_id":"AF-015","domain":"tool_calling","task_type":"browser_action","agent_stack":"OpenClaw","prompt_summary":"Download a media file through a browser tool","observed_failure":"The tool reported completion, but the expected downloaded file was absent.","failure_labels":["tool_call_noop","false_success"],"signals":{"hard":["tool_status_ok","file_missing"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"openclaw_browser","model":"none","latency_ms":15500,"success_signal":"tool_reported_ok"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["download_path","file_exists_check","tool_log"],"regression_guard_action":["block_completion_claim","retry_with_file_check"]}
+{"case_id":"AF-016","domain":"structured_output","task_type":"json_report","agent_stack":"Claude Code","prompt_summary":"Produce a structured review JSON","observed_failure":"The output stopped mid-object and could not be parsed.","failure_labels":["partial_output_truncation"],"signals":{"hard":["invalid_json","truncation_detected"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"claude_code","model":"unknown","latency_ms":22300,"success_signal":"partial_text_present"},"expected_routing":{"verdict":"learning_only","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["raw_output","json_parse_error"],"regression_guard_action":["retry_with_smaller_schema","record_degraded_output"]}
+{"case_id":"AF-017","domain":"sync","task_type":"cross_machine_handoff","agent_stack":"custom","prompt_summary":"Consume a Win11 handoff from Mac","observed_failure":"The message claimed a patch existed, but no synced artifact was readable on Mac.","failure_labels":["silent_failure","stale_artifact"],"signals":{"hard":["artifact_missing_on_receiver"],"soft":["handoff_claim_present"]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"local_sync","model":"none","latency_ms":0,"success_signal":"chat_claim_present"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["handoff_path","artifact_exists_check","sync_timestamp"],"regression_guard_action":["mark_unsynced","request_artifact_path"]}
+{"case_id":"AF-018","domain":"ci","task_type":"github_actions","agent_stack":"GitHub Actions","prompt_summary":"Confirm a pull request is healthy","observed_failure":"Local tests passed but remote CI failed on Windows.","failure_labels":["false_success","patch_without_test"],"signals":{"hard":["local_pass","remote_ci_failed"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"github_actions","model":"none","latency_ms":0,"success_signal":"local_tests_passed"},"expected_routing":{"verdict":"manual_review_required","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["local_test_output","remote_ci_url","failed_job_log"],"regression_guard_action":["block_merge","inspect_remote_failure"]}
+{"case_id":"AF-019","domain":"memory","task_type":"context_reuse","agent_stack":"custom","prompt_summary":"Reuse prior run memory for a current status report","observed_failure":"The answer presented old memory as current state without verifying current artifacts.","failure_labels":["stale_artifact","false_success"],"signals":{"hard":["memory_timestamp_old","no_current_artifact_check"],"soft":["answer_confident"]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"local_memory","model":"none","latency_ms":0,"success_signal":"memory_summary_present"},"expected_routing":{"verdict":"learning_only","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["memory_timestamp","current_artifact_probe"],"regression_guard_action":["label_memory_derived","refresh_artifacts"]}
+{"case_id":"AF-020","domain":"provider_auth","task_type":"relay_probe","agent_stack":"custom","prompt_summary":"Use a secondary model relay for review","observed_failure":"Relay returned 401 invalid key but the downstream summary still treated it as available.","failure_labels":["provider_route_drift","false_success"],"signals":{"hard":["auth_401","provider_marked_available"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"claude_relay","model":"claude-opus","latency_ms":1200,"success_signal":"provider_listed"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["probe_status","provider_id","failure_reason"],"regression_guard_action":["mark_provider_blocked","exclude_from_review"]}
+{"case_id":"AF-021","domain":"quota","task_type":"model_probe","agent_stack":"custom","prompt_summary":"Check required model before a workflow","observed_failure":"Models endpoint succeeded, but chat probe failed because accounts were exhausted.","failure_labels":["silent_failure","provider_route_drift"],"signals":{"hard":["models_ok","chat_probe_failed"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"openai_relay","model":"deepseek-v3.2","latency_ms":2400,"success_signal":"models_endpoint_ok"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["models_probe","chat_probe","quota_error"],"regression_guard_action":["block_workflow","repair_account_pool"]}
+{"case_id":"AF-022","domain":"runtime","task_type":"long_running_task","agent_stack":"custom","prompt_summary":"Run a digest builder with sparse stdout","observed_failure":"The process looked frozen for more than a minute because no progress events were emitted during a long step.","failure_labels":["silent_failure","missing_event_trail"],"signals":{"hard":["stdout_silence_gt_threshold","missing_progress_events"],"soft":["process_still_running"]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local","model":"none","latency_ms":110000,"success_signal":"final_status_ok"},"expected_routing":{"verdict":"learning_only","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["start_event","progress_events","done_event"],"regression_guard_action":["add_progress_events","do_not_mark_freeze_if_done_event_exists"]}
+{"case_id":"AF-023","domain":"external_repro","task_type":"user_feedback","agent_stack":"custom","prompt_summary":"Ask an external user to run the rollback demo","observed_failure":"The repo got a star but no repro log, issue, or event trail was provided.","failure_labels":["false_success"],"signals":{"hard":["no_repro_artifact"],"soft":["social_signal_present"]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"github","model":"none","latency_ms":0,"success_signal":"star_or_like"},"expected_routing":{"verdict":"learning_only","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["external_issue","command_output","event_trail"],"regression_guard_action":["do_not_count_as_repro","request_concrete_log"]}
+{"case_id":"AF-024","domain":"release","task_type":"package_entrypoint","agent_stack":"pip","prompt_summary":"Validate package installation from a release wheel","observed_failure":"Package installed, but the CLI entrypoint failed on import.","failure_labels":["false_success","patch_without_test"],"signals":{"hard":["pip_install_ok","cli_import_failed"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"pip","model":"none","latency_ms":0,"success_signal":"install_ok"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["install_log","cli_output","import_traceback"],"regression_guard_action":["block_release_claim","add_cli_smoke"]}
+{"case_id":"AF-025","domain":"data_governance","task_type":"experience_pool","agent_stack":"custom","prompt_summary":"Accept an anonymous experience record","observed_failure":"The record included a raw user identifier in a free-text field.","failure_labels":["unsafe_action_escalation"],"signals":{"hard":["forbidden_field_present","privacy_violation"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local_validator","model":"none","latency_ms":0,"success_signal":"json_record_present"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["record_json","validator_output"],"regression_guard_action":["reject_record","redact_or_remove_field"]}
+{"case_id":"AF-026","domain":"prompt_reliability","task_type":"system_prompt_update","agent_stack":"custom","prompt_summary":"Update a system prompt for stricter failures","observed_failure":"Prompt wording changed but no before/after failure case was rerun.","failure_labels":["patch_without_test"],"signals":{"hard":["prompt_diff_present","missing_eval_rerun"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local","model":"none","latency_ms":0,"success_signal":"prompt_file_changed"},"expected_routing":{"verdict":"manual_review_required","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["prompt_diff","before_eval","after_eval"],"regression_guard_action":["block_claim","run_eval_packet"]}
+{"case_id":"AF-027","domain":"scheduler","task_type":"cron_run","agent_stack":"cron","prompt_summary":"Run a scheduled reliability audit","observed_failure":"The scheduler fired, but the expected output directory had no new files.","failure_labels":["silent_failure","tool_call_noop"],"signals":{"hard":["cron_triggered","no_new_artifact"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":false,"provider_route":"local_cron","model":"none","latency_ms":0,"success_signal":"scheduler_started"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["scheduler_log","output_dir_mtime","expected_artifact_list"],"regression_guard_action":["mark_run_failed","inspect_scheduler_log"]}
+{"case_id":"AF-028","domain":"artifact_manifest","task_type":"archive","agent_stack":"custom","prompt_summary":"Archive a successful workflow run","observed_failure":"The success pointer moved, but one required artifact was absent from the archive.","failure_labels":["silent_failure","stale_artifact"],"signals":{"hard":["pointer_updated","archive_missing_required_file"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local","model":"none","latency_ms":0,"success_signal":"pointer_updated"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["archive_manifest","required_file_list","pointer_file"],"regression_guard_action":["invalidate_pointer","rebuild_archive"]}
+{"case_id":"AF-029","domain":"multi_agent","task_type":"handoff","agent_stack":"CrewAI","prompt_summary":"Delegate a code edit to a worker agent","observed_failure":"Worker edited overlapping files and reverted another worker's changes.","failure_labels":["context_drift","patch_without_test"],"signals":{"hard":["overlapping_write_set","unexpected_revert"],"soft":["handoff_scope_ambiguous"]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"crewai_worker","model":"unknown","latency_ms":42000,"success_signal":"worker_completed"},"expected_routing":{"verdict":"manual_review_required","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":true},"evidence_required":["worker_assignment","diff","test_output"],"regression_guard_action":["block_merge","split_write_scope"]}
+{"case_id":"AF-030","domain":"security","task_type":"secret_scan","agent_stack":"custom","prompt_summary":"Publish a repository snapshot","observed_failure":"Local scan found an untracked secret-like file outside git, and the publish flow did not record whether it was excluded.","failure_labels":["silent_failure","unsafe_action_escalation"],"signals":{"hard":["secret_like_file_present","publish_exclusion_unverified"],"soft":[]},"trace":{"event_flow_present":true,"artifact_present":true,"provider_route":"local_secret_scan","model":"none","latency_ms":0,"success_signal":"git_status_clean"},"expected_routing":{"verdict":"blocked","judgment_allowed":false,"paper_buy_allowed":false,"trade_allowed":false,"promote_allowed":false,"human_review_required":false},"evidence_required":["secret_scan_report","git_status","ignore_rule"],"regression_guard_action":["block_publish","quarantine_or_verify_exclusion"]}
diff --git a/examples/verify_agent_eval_cases.py b/examples/verify_agent_eval_cases.py
new file mode 100644
index 0000000..e0fb129
--- /dev/null
+++ b/examples/verify_agent_eval_cases.py
@@ -0,0 +1,183 @@
+#!/usr/bin/env python3
+"""Verify the bundled agent eval case JSONL packet.
+
+The packet is intentionally conservative: it may describe failures that require
+review, but it must never authorize judgment, paper-buy, trade, or promote.
+"""
+
+from __future__ import annotations
+
+import json
+import sys
+from pathlib import Path
+from typing import Any
+
+
+ALLOWED_LABELS = {
+    "artifact_hash_mismatch",
+    "config_patch_regression",
+    "context_drift",
+    "duplicate_request",
+    "false_success",
+    "http_200_empty_output",
+    "human_review_missing",
+    "latency_regression",
+    "missing_event_trail",
+    "partial_output_truncation",
+    "patch_without_test",
+    "provider_route_drift",
+    "rollback_missing",
+    "silent_failure",
+    "stale_artifact",
+    "success_rate_regression",
+    "tool_call_noop",
+    "unsafe_action_escalation",
+}
+
+ALLOWED_VERDICTS = {"blocked", "learning_only", "manual_review_required"}
+REQUIRED_TOP_LEVEL = {
+    "agent_stack",
+    "case_id",
+    "domain",
+    "evidence_required",
+    "expected_routing",
+    "failure_labels",
+    "observed_failure",
+    "prompt_summary",
+    "regression_guard_action",
+    "signals",
+    "task_type",
+    "trace",
+}
+FORBIDDEN_TRUE_FLAGS = {
+    "judgment_allowed",
+    "paper_buy_allowed",
+    "promote_allowed",
+    "trade_allowed",
+}
+
+
+def load_jsonl(path: Path) -> tuple[list[dict[str, Any]], list[str]]:
+    records: list[dict[str, Any]] = []
+    failures: list[str] = []
+    for line_no, line in enumerate(path.read_text(encoding="utf-8").splitlines(), 1):
+        if not line.strip():
+            continue
+        try:
+            record = json.loads(line)
+        except json.JSONDecodeError as exc:
+            failures.append(f"line_{line_no}:invalid_json:{exc.msg}")
+            continue
+        if not isinstance(record, dict):
+            failures.append(f"line_{line_no}:not_object")
+            continue
+        records.append(record)
+    return records, failures
+
+
+def verify_record(record: dict[str, Any], index: int) -> list[str]:
+    case_id = str(record.get("case_id") or f"line_{index}")
+    failures: list[str] = []
+
+    missing = sorted(REQUIRED_TOP_LEVEL - set(record))
+    if missing:
+        failures.append(f"{case_id}:missing_fields={missing}")
+
+    labels = record.get("failure_labels")
+    if not isinstance(labels, list) or not labels:
+        failures.append(f"{case_id}:failure_labels_missing_or_empty")
+    else:
+        unknown = sorted(label for label in labels if label not in ALLOWED_LABELS)
+        if unknown:
+            failures.append(f"{case_id}:unknown_labels={unknown}")
+
+    signals = record.get("signals")
+    if not isinstance(signals, dict):
+        failures.append(f"{case_id}:signals_not_object")
+    else:
+        if not isinstance(signals.get("hard"), list):
+            failures.append(f"{case_id}:signals_hard_not_list")
+        if not isinstance(signals.get("soft"), list):
+            failures.append(f"{case_id}:signals_soft_not_list")
+
+    trace = record.get("trace")
+    if not isinstance(trace, dict):
+        failures.append(f"{case_id}:trace_not_object")
+    else:
+        if not isinstance(trace.get("event_flow_present"), bool):
+            failures.append(f"{case_id}:trace_event_flow_present_not_bool")
+        if not isinstance(trace.get("artifact_present"), bool):
+            failures.append(f"{case_id}:trace_artifact_present_not_bool")
+
+    routing = record.get("expected_routing")
+    if not isinstance(routing, dict):
+        failures.append(f"{case_id}:expected_routing_not_object")
+    else:
+        verdict = routing.get("verdict")
+        if verdict not in ALLOWED_VERDICTS:
+            failures.append(f"{case_id}:unsupported_verdict={verdict}")
+        for flag in FORBIDDEN_TRUE_FLAGS:
+            if routing.get(flag) is not False:
+                failures.append(f"{case_id}:{flag}_not_false")
+        if not isinstance(routing.get("human_review_required"), bool):
+            failures.append(f"{case_id}:human_review_required_not_bool")
+
+    for field in ("evidence_required", "regression_guard_action"):
+        value = record.get(field)
+        if not isinstance(value, list) or not value:
+            failures.append(f"{case_id}:{field}_missing_or_empty")
+
+    return failures
+
+
+def main(argv: list[str]) -> int:
+    if len(argv) > 2:
+        print("usage: verify_agent_eval_cases.py [agent_eval_cases.jsonl]")
+        return 2
+
+    path = (
+        Path(argv[1]).expanduser()
+        if len(argv) == 2
+        else Path(__file__).with_name("agent_eval_cases.jsonl")
+    )
+    path = path.resolve()
+    if not path.exists():
+        print(f"verdict=blocked")
+        print(f"missing_eval_cases={path}")
+        return 1
+
+    records, failures = load_jsonl(path)
+    if len(records) < 30:
+        failures.append(f"case_count_lt_30:{len(records)}")
+
+    case_ids: set[str] = set()
+    duplicate_ids: set[str] = set()
+    for index, record in enumerate(records, 1):
+        case_id = str(record.get("case_id") or f"line_{index}")
+        if case_id in case_ids:
+            duplicate_ids.add(case_id)
+        case_ids.add(case_id)
+        failures.extend(verify_record(record, index))
+
+    if duplicate_ids:
+        failures.append(f"duplicate_case_ids={sorted(duplicate_ids)}")
+
+    if failures:
+        print("verdict=failed")
+        print(f"case_count={len(records)}")
+        for failure in failures:
+            print(failure)
+        return 1
+
+    print("verdict=ok")
+    print(f"case_count={len(records)}")
+    print("judgment_allowed=false")
+    print("paper_buy_allowed=false")
+    print("trade_allowed=false")
+    print("promote_allowed=false")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main(sys.argv))
+
diff --git a/tests/test_agent_eval_cases.py b/tests/test_agent_eval_cases.py
new file mode 100644
index 0000000..dc5c931
--- /dev/null
+++ b/tests/test_agent_eval_cases.py
@@ -0,0 +1,31 @@
+"""The bundled eval cases should stay parseable and non-authorizing."""
+
+import subprocess
+import sys
+from pathlib import Path
+
+
+ROOT = Path(__file__).resolve().parent.parent
+
+
+def test_agent_eval_cases_verify():
+    result = subprocess.run(
+        [
+            sys.executable,
+            str(ROOT / "examples" / "verify_agent_eval_cases.py"),
+            str(ROOT / "examples" / "agent_eval_cases.jsonl"),
+        ],
+        cwd=str(ROOT),
+        text=True,
+        capture_output=True,
+        timeout=20,
+    )
+
+    assert result.returncode == 0, result.stdout + result.stderr
+    assert "verdict=ok" in result.stdout
+    assert "case_count=30" in result.stdout
+    assert "judgment_allowed=false" in result.stdout
+    assert "paper_buy_allowed=false" in result.stdout
+    assert "trade_allowed=false" in result.stdout
+    assert "promote_allowed=false" in result.stdout
+

From b6b70a3f7db5e7b2968bb191d347738f7e96cfc6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E7=8F=8A=E7=91=9A=E6=B5=B7?= <你的邮箱>
Date: Wed, 29 Apr 2026 20:26:59 +0800
Subject: [PATCH 2/2] ci: verify agent eval cases

---
 .github/workflows/ci.yml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 7a79322..3ad7552 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -46,6 +46,7 @@ jobs:
           python examples/04_yijing_strategy.py
           python examples/regression_rollback_demo.py --data-dir .repro-demo
           python examples/verify_regression_rollback_event_trail.py .repro-demo/regression_rollback_event_trail.jsonl
+          python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
           python examples/wrap_existing_agent.py
           python examples/langgraph_style_node.py
           python examples/05_langgraph_regression_guard.py