yangfei222666-9 · yangfei222666-9 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -46,6 +46,7 @@ jobs:
           python examples/04_yijing_strategy.py
           python examples/regression_rollback_demo.py --data-dir .repro-demo
           python examples/verify_regression_rollback_event_trail.py .repro-demo/regression_rollback_event_trail.jsonl
+          python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
           python examples/wrap_existing_agent.py
           python examples/langgraph_style_node.py
           python examples/05_langgraph_regression_guard.py

diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@
 [![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
 [![LLM overhead](https://img.shields.io/badge/LLM%20overhead-%3C1%25-brightgreen)](#performance)
 
-Latest verified release: [v0.1.1](https://github.com/yangfei222666-9/self-improving-loop/releases/tag/v0.1.1) · Agent data strategy: [AGENT_DATA_STRATEGY.md](AGENT_DATA_STRATEGY.md) · External repro: [EXTERNAL_REPRO.md](EXTERNAL_REPRO.md) · Launch copy: [English + 中文](LAUNCH_COPY_BILINGUAL.md) · Hermes-style guard: [docs/HERMES_SKILL_GUARD.md](docs/HERMES_SKILL_GUARD.md)
+Latest verified release: [v0.1.1](https://github.com/yangfei222666-9/self-improving-loop/releases/tag/v0.1.1) · Agent data strategy: [AGENT_DATA_STRATEGY.md](AGENT_DATA_STRATEGY.md) · Eval labels: [docs/ANNOTATION_GUIDELINE.md](docs/ANNOTATION_GUIDELINE.md) · External repro: [EXTERNAL_REPRO.md](EXTERNAL_REPRO.md) · Hermes-style guard: [docs/HERMES_SKILL_GUARD.md](docs/HERMES_SKILL_GUARD.md)
 
 中文定位：`self-improving-loop` 是 AI Agent 的回归保护层。它包住 LangGraph / Hermes / 自定义 agent 节点，记录 trace，检测成功率或延迟退化，回滚坏配置，并保留可复查事件证据。
 
@@ -155,6 +155,16 @@ python3 examples/regression_rollback_demo.py --data-dir .repro-demo
 python3 examples/verify_regression_rollback_event_trail.py .repro-demo/regression_rollback_event_trail.jsonl
 ```
 
+For the bundled agent-failure eval packet, run:
+
+```bash
+python3 examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
+```
+
+The packet contains 30 non-authorizing cases for silent failure, stale artifacts,
+provider drift, missing event trails, rollback gaps, and unsafe action
+escalation. It is eval data only: no judgment, paper-buy, trade, or promote.
+
 ---
 
 ## Use it as a safety layer for your current agent

diff --git a/docs/AI_CODING_TOOL_FAILURE_NOTES.md b/docs/AI_CODING_TOOL_FAILURE_NOTES.md
@@ -0,0 +1,97 @@
+# AI Coding Tool Failure Notes
+
+This note maps common Claude Code, Cursor, OpenClaw, Hermes, and custom coding
+agent failures into trace fields, failure labels, and guard actions.
+
+The package should be positioned as a reliability layer around these tools, not
+as a replacement for them.
+
+## Core Pattern
+
+```text
+agent action -> trace -> failure label -> verifier -> rollback or block
+```
+
+The important question is not whether the tool produced code. The important
+question is whether the task is complete, tested, and recoverable.
+
+## Common Failure Modes
+
+| Failure mode | Label | Guard action |
+| --- | --- | --- |
+| Patch generated but no matching test ran. | `patch_without_test` | Block promote; require verifier. |
+| Tool call says ok but file/state is unchanged. | `tool_call_noop` | Re-read target state; retry or block. |
+| Multi-turn repair drifts away from the original issue. | `context_drift` | Re-anchor to original task and stop current branch. |
+| Provider changes from intended model to fallback without disclosure. | `provider_route_drift` | Mark blocked until route is verified. |
+| Agent output is truncated or invalid JSON. | `partial_output_truncation` | Retry with bounded prompt or mark degraded. |
+| HTTP 200 response contains empty content. | `http_200_empty_output` | Treat as failure, not ok. |
+| Generated artifact is older than the current run. | `stale_artifact` | Block success claim; require fresh run id/hash. |
+| Review packet points to a different source artifact. | `artifact_hash_mismatch` | Block review; regenerate packet. |
+| Agent repeats the same request hash. | `duplicate_request` | Stop replay and record duplicate evidence. |
+| Regression detected but no rollback event exists. | `rollback_missing` | Block; require restore evidence. |
+| Learning output tries to authorize trade/promote/deploy. | `unsafe_action_escalation` | Hard block. |
+
+## Mapping to self-improving-loop
+
+`self-improving-loop` already provides the runtime seam:
+
+- Trace each callable execution.
+- Track success rate and latency.
+- Trigger a strategy hook when failure patterns cross a threshold.
+- Apply a guarded config patch.
+- Roll back when the patch regresses quality.
+- Preserve an event trail for audit.
+
+The eval packet layer adds the missing data strategy layer:
+
+- Convert traces into failure labels.
+- Separate hard signals from soft reviewer observations.
+- Decide whether the case is `learning_only`, `blocked`, or `manual_review_required`.
+- Prevent eval data from becoming execution authority.
+
+## Tool-Specific Notes
+
+### Claude Code / Cursor
+
+Typical risk: patch quality appears high, but the actual task may be unverified.
+
+Minimum guard:
+
+- Record changed files.
+- Record tests/verifiers run.
+- Record stdout/stderr.
+- Recalculate artifact hashes after patch.
+- Block if no test or verifier maps to the change.
+
+### OpenClaw
+
+Typical risk: a tool route is available in configuration but not actually
+healthy at runtime.
+
+Minimum guard:
+
+- Probe gateway health before trusting tool availability.
+- Record actual route, version, and failure reason.
+- Treat configured-but-unreachable tools as blocked, not degraded.
+
+### Hermes / Skill Runtimes
+
+Typical risk: skill config changes can degrade output while still returning a
+formally successful result.
+
+Minimum guard:
+
+- Store previous skill config.
+- Run a baseline task.
+- Apply the candidate patch.
+- Compare quality and latency.
+- Roll back if worse.
+- Verify event trail contains restore evidence.
+
+## Non-Goals
+
+This note does not claim model training, RL, or autonomous self-improvement.
+Before those claims are valid, the system needs sustained runs, accumulated
+cases, real promote/demote/archive behavior, and at least one learned rule that
+changes future behavior under audit.
+
diff --git a/docs/ANNOTATION_GUIDELINE.md b/docs/ANNOTATION_GUIDELINE.md
@@ -0,0 +1,121 @@
+# Agent Eval Annotation Guideline
+
+This guideline defines how to label agent workflow failures for regression
+guards, eval packets, and human review queues.
+
+It is not a benchmark claim. It is a conservative labeling policy for turning
+agent execution evidence into auditable data.
+
+## Goal
+
+Convert an agent run into a structured eval case:
+
+```text
+task -> trace -> artifact -> failure label -> routing verdict -> evidence
+```
+
+The labeler must not infer success from a final text answer alone. A run is only
+trusted when the required trace, artifact, and verifier evidence exist.
+
+## Required Fields
+
+Each eval case should capture:
+
+- `case_id`: stable unique identifier.
+- `domain`: workflow family, such as coding_agent, tool_calling, or provider_route.
+- `task_type`: what the agent was trying to do.
+- `agent_stack`: framework or runner shape, such as LangGraph, Hermes, OpenClaw, Cursor, Claude Code, or custom.
+- `prompt_summary`: short non-sensitive task summary.
+- `observed_failure`: what actually went wrong.
+- `failure_labels`: one or more controlled labels.
+- `signals.hard`: machine-checkable signals.
+- `signals.soft`: reviewer observations that are useful but not sufficient alone.
+- `trace`: provider, model, latency, success signal, artifact status, and event-flow status.
+- `expected_routing`: conservative routing decision.
+- `evidence_required`: artifacts needed before the case can be trusted.
+- `regression_guard_action`: guard action such as block, rollback, retry, or human review.
+
+## Label Set
+
+Use the narrowest labels that are supported by evidence.
+
+| Label | Meaning | Minimum evidence |
+| --- | --- | --- |
+| `silent_failure` | The workflow looks finished, but a required step did not actually complete. | Missing event, missing artifact, or verifier contradiction. |
+| `false_success` | The run reports success while evidence proves the target was not completed. | Success status plus failed verifier or stale/missing artifact. |
+| `stale_artifact` | Output was reused from an older run. | Timestamp, run id, source hash, or archive mismatch. |
+| `missing_event_trail` | The result lacks a parseable event trail. | Missing JSONL/trace or unreadable trace. |
+| `provider_route_drift` | The actual model/provider differs from the intended route. | Expected route and actual route disagree. |
+| `latency_regression` | Latency worsened beyond threshold. | Before/after latency metrics. |
+| `success_rate_regression` | Success rate dropped beyond threshold. | Before/after success metrics. |
+| `tool_call_noop` | A tool call returned ok but did not change the target state. | Tool result plus unchanged artifact/state. |
+| `patch_without_test` | Code was changed without a verifier or test proving behavior. | Diff exists, no matching test/verifier evidence. |
+| `context_drift` | Multi-turn repair no longer addresses the original task. | Original task and later action diverge. |
+| `http_200_empty_output` | Provider returned HTTP 200 but the usable output was empty or unparsable. | HTTP status plus parse failure or empty content. |
+| `rollback_missing` | Regression was detected but no rollback evidence exists. | Regression event without rollback event. |
+| `config_patch_regression` | A config/prompt/tool patch made quality worse. | Before/after quality or rollback trigger. |
+| `duplicate_request` | Same request hash/session was submitted twice and should be blocked. | Duplicate hash count or replay evidence. |
+| `partial_output_truncation` | Output is cut off and cannot support a complete verdict. | Truncation marker, invalid JSON, or missing required section. |
+| `artifact_hash_mismatch` | Review packet does not match the source artifact it claims to review. | Stored hash differs from recalculated hash. |
+| `unsafe_action_escalation` | Learning/review output tries to authorize a risky action. | `trade_allowed`, `paper_buy_allowed`, `promote_allowed`, or equivalent set true. |
+| `human_review_missing` | Human review was required but absent. | Review-required gate plus missing review artifact. |
+
+## Routing Policy
+
+Eval cases are not execution permission.
+
+Allowed routing values:
+
+- `learning_only`: safe to store as training/eval evidence only.
+- `blocked`: cannot be used until missing evidence or hard failure is fixed.
+- `manual_review_required`: human review can inspect it, but execution remains blocked.
+
+Forbidden in eval packets:
+
+- `judgment_allowed=true`
+- `paper_buy_allowed=true`
+- `trade_allowed=true`
+- `promote_allowed=true`
+
+If any forbidden flag appears in an eval packet, the packet verifier must return
+`blocked`.
+
+## Hard vs Soft Signals
+
+Hard signals are machine-checkable and can block automatically:
+
+- Missing artifact.
+- Hash mismatch.
+- Duplicate request hash.
+- Provider/model mismatch.
+- HTTP 200 with empty or invalid JSON.
+- Event trail missing or unparsable.
+- Test/verifier failure.
+
+Soft signals require review and cannot promote alone:
+
+- The answer feels off-topic.
+- The patch looks risky.
+- The wording is ambiguous.
+- The model gave a plausible but unverified explanation.
+
+## Training vs Eval vs Review
+
+Use cases this way:
+
+- Training candidate: stable label, safe content, no secrets, no execution authority.
+- Eval case: deterministic expected route and required evidence are known.
+- Human review: ambiguous, high-impact, or soft-signal-heavy cases.
+- Blocked: missing evidence, unsafe action escalation, or corrupted event flow.
+
+## Stop Rules
+
+Stop and mark `blocked` when:
+
+- The event trail is missing or unparsable.
+- The source artifact is missing.
+- A learning packet contains any execution authorization.
+- The provider route cannot be verified.
+- The case contains secrets or personal data.
+- The case relies on a stale output as fresh evidence.
+
diff --git a/examples/README.md b/examples/README.md
@@ -9,6 +9,7 @@ python examples/03_langgraph_adapter.py
 python examples/04_yijing_strategy.py
 python examples/05_langgraph_regression_guard.py
 python examples/06_hermes_skill_regression_guard.py
+python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
 ```
 
 ## 01_basic_tracking.py
@@ -62,3 +63,23 @@ Proves the Hermes-style skill seam without a Hermes dependency:
 
 Use this when someone asks how the package fits under Hermes, OpenClaw, or any
 skill-based agent runtime instead of competing with it.
+
+## agent_eval_cases.jsonl
+
+Provides 30 non-authorizing eval cases for coding-agent, tool-calling,
+provider-route, stale-artifact, rollback, and governance failures.
+
+Verify the packet:
+
+```bash
+python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
+```
+
+Expected boundary:
+
+```text
+judgment_allowed=false
+paper_buy_allowed=false
+trade_allowed=false
+promote_allowed=false
+```