Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ jobs:
python examples/04_yijing_strategy.py
python examples/regression_rollback_demo.py --data-dir .repro-demo
python examples/verify_regression_rollback_event_trail.py .repro-demo/regression_rollback_event_trail.jsonl
python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
python examples/wrap_existing_agent.py
python examples/langgraph_style_node.py
python examples/05_langgraph_regression_guard.py
Expand Down
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
[![LLM overhead](https://img.shields.io/badge/LLM%20overhead-%3C1%25-brightgreen)](#performance)

Latest verified release: [v0.1.1](https://github.com/yangfei222666-9/self-improving-loop/releases/tag/v0.1.1) · Agent data strategy: [AGENT_DATA_STRATEGY.md](AGENT_DATA_STRATEGY.md) · External repro: [EXTERNAL_REPRO.md](EXTERNAL_REPRO.md) · Launch copy: [English + 中文](LAUNCH_COPY_BILINGUAL.md) · Hermes-style guard: [docs/HERMES_SKILL_GUARD.md](docs/HERMES_SKILL_GUARD.md)
Latest verified release: [v0.1.1](https://github.com/yangfei222666-9/self-improving-loop/releases/tag/v0.1.1) · Agent data strategy: [AGENT_DATA_STRATEGY.md](AGENT_DATA_STRATEGY.md) · Eval labels: [docs/ANNOTATION_GUIDELINE.md](docs/ANNOTATION_GUIDELINE.md) · External repro: [EXTERNAL_REPRO.md](EXTERNAL_REPRO.md) · Hermes-style guard: [docs/HERMES_SKILL_GUARD.md](docs/HERMES_SKILL_GUARD.md)

中文定位:`self-improving-loop` 是 AI Agent 的回归保护层。它包住 LangGraph / Hermes / 自定义 agent 节点,记录 trace,检测成功率或延迟退化,回滚坏配置,并保留可复查事件证据。

Expand Down Expand Up @@ -155,6 +155,16 @@ python3 examples/regression_rollback_demo.py --data-dir .repro-demo
python3 examples/verify_regression_rollback_event_trail.py .repro-demo/regression_rollback_event_trail.jsonl
```

For the bundled agent-failure eval packet, run:

```bash
python3 examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
```

The packet contains 30 non-authorizing cases for silent failure, stale artifacts,
provider drift, missing event trails, rollback gaps, and unsafe action
escalation. It is eval data only: no judgment, paper-buy, trade, or promote.

---

## Use it as a safety layer for your current agent
Expand Down
97 changes: 97 additions & 0 deletions docs/AI_CODING_TOOL_FAILURE_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# AI Coding Tool Failure Notes

This note maps common Claude Code, Cursor, OpenClaw, Hermes, and custom coding
agent failures into trace fields, failure labels, and guard actions.

The package should be positioned as a reliability layer around these tools, not
as a replacement for them.

## Core Pattern

```text
agent action -> trace -> failure label -> verifier -> rollback or block
```

The important question is not whether the tool produced code. The important
question is whether the task is complete, tested, and recoverable.

## Common Failure Modes

| Failure mode | Label | Guard action |
| --- | --- | --- |
| Patch generated but no matching test ran. | `patch_without_test` | Block promote; require verifier. |
| Tool call says ok but file/state is unchanged. | `tool_call_noop` | Re-read target state; retry or block. |
| Multi-turn repair drifts away from the original issue. | `context_drift` | Re-anchor to original task and stop current branch. |
| Provider changes from intended model to fallback without disclosure. | `provider_route_drift` | Mark blocked until route is verified. |
| Agent output is truncated or invalid JSON. | `partial_output_truncation` | Retry with bounded prompt or mark degraded. |
| HTTP 200 response contains empty content. | `http_200_empty_output` | Treat as failure, not ok. |
| Generated artifact is older than the current run. | `stale_artifact` | Block success claim; require fresh run id/hash. |
| Review packet points to a different source artifact. | `artifact_hash_mismatch` | Block review; regenerate packet. |
| Agent repeats the same request hash. | `duplicate_request` | Stop replay and record duplicate evidence. |
| Regression detected but no rollback event exists. | `rollback_missing` | Block; require restore evidence. |
| Learning output tries to authorize trade/promote/deploy. | `unsafe_action_escalation` | Hard block. |

## Mapping to self-improving-loop

`self-improving-loop` already provides the runtime seam:

- Trace each callable execution.
- Track success rate and latency.
- Trigger a strategy hook when failure patterns cross a threshold.
- Apply a guarded config patch.
- Roll back when the patch regresses quality.
- Preserve an event trail for audit.

The eval packet layer adds the missing data strategy layer:

- Convert traces into failure labels.
- Separate hard signals from soft reviewer observations.
- Decide whether the case is `learning_only`, `blocked`, or `manual_review_required`.
- Prevent eval data from becoming execution authority.

## Tool-Specific Notes

### Claude Code / Cursor

Typical risk: patch quality appears high, but the actual task may be unverified.

Minimum guard:

- Record changed files.
- Record tests/verifiers run.
- Record stdout/stderr.
- Recalculate artifact hashes after patch.
- Block if no test or verifier maps to the change.

### OpenClaw

Typical risk: a tool route is available in configuration but not actually
healthy at runtime.

Minimum guard:

- Probe gateway health before trusting tool availability.
- Record actual route, version, and failure reason.
- Treat configured-but-unreachable tools as blocked, not degraded.

### Hermes / Skill Runtimes

Typical risk: skill config changes can degrade output while still returning a
formally successful result.

Minimum guard:

- Store previous skill config.
- Run a baseline task.
- Apply the candidate patch.
- Compare quality and latency.
- Roll back if worse.
- Verify event trail contains restore evidence.

## Non-Goals

This note does not claim model training, RL, or autonomous self-improvement.
Before those claims are valid, the system needs sustained runs, accumulated
cases, real promote/demote/archive behavior, and at least one learned rule that
changes future behavior under audit.

121 changes: 121 additions & 0 deletions docs/ANNOTATION_GUIDELINE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Agent Eval Annotation Guideline

This guideline defines how to label agent workflow failures for regression
guards, eval packets, and human review queues.

It is not a benchmark claim. It is a conservative labeling policy for turning
agent execution evidence into auditable data.

## Goal

Convert an agent run into a structured eval case:

```text
task -> trace -> artifact -> failure label -> routing verdict -> evidence
```

The labeler must not infer success from a final text answer alone. A run is only
trusted when the required trace, artifact, and verifier evidence exist.

## Required Fields

Each eval case should capture:

- `case_id`: stable unique identifier.
- `domain`: workflow family, such as coding_agent, tool_calling, or provider_route.
- `task_type`: what the agent was trying to do.
- `agent_stack`: framework or runner shape, such as LangGraph, Hermes, OpenClaw, Cursor, Claude Code, or custom.
- `prompt_summary`: short non-sensitive task summary.
- `observed_failure`: what actually went wrong.
- `failure_labels`: one or more controlled labels.
- `signals.hard`: machine-checkable signals.
- `signals.soft`: reviewer observations that are useful but not sufficient alone.
- `trace`: provider, model, latency, success signal, artifact status, and event-flow status.
- `expected_routing`: conservative routing decision.
- `evidence_required`: artifacts needed before the case can be trusted.
- `regression_guard_action`: guard action such as block, rollback, retry, or human review.

## Label Set

Use the narrowest labels that are supported by evidence.

| Label | Meaning | Minimum evidence |
| --- | --- | --- |
| `silent_failure` | The workflow looks finished, but a required step did not actually complete. | Missing event, missing artifact, or verifier contradiction. |
| `false_success` | The run reports success while evidence proves the target was not completed. | Success status plus failed verifier or stale/missing artifact. |
| `stale_artifact` | Output was reused from an older run. | Timestamp, run id, source hash, or archive mismatch. |
| `missing_event_trail` | The result lacks a parseable event trail. | Missing JSONL/trace or unreadable trace. |
| `provider_route_drift` | The actual model/provider differs from the intended route. | Expected route and actual route disagree. |
| `latency_regression` | Latency worsened beyond threshold. | Before/after latency metrics. |
| `success_rate_regression` | Success rate dropped beyond threshold. | Before/after success metrics. |
| `tool_call_noop` | A tool call returned ok but did not change the target state. | Tool result plus unchanged artifact/state. |
| `patch_without_test` | Code was changed without a verifier or test proving behavior. | Diff exists, no matching test/verifier evidence. |
| `context_drift` | Multi-turn repair no longer addresses the original task. | Original task and later action diverge. |
| `http_200_empty_output` | Provider returned HTTP 200 but the usable output was empty or unparsable. | HTTP status plus parse failure or empty content. |
| `rollback_missing` | Regression was detected but no rollback evidence exists. | Regression event without rollback event. |
| `config_patch_regression` | A config/prompt/tool patch made quality worse. | Before/after quality or rollback trigger. |
| `duplicate_request` | Same request hash/session was submitted twice and should be blocked. | Duplicate hash count or replay evidence. |
| `partial_output_truncation` | Output is cut off and cannot support a complete verdict. | Truncation marker, invalid JSON, or missing required section. |
| `artifact_hash_mismatch` | Review packet does not match the source artifact it claims to review. | Stored hash differs from recalculated hash. |
| `unsafe_action_escalation` | Learning/review output tries to authorize a risky action. | `trade_allowed`, `paper_buy_allowed`, `promote_allowed`, or equivalent set true. |
| `human_review_missing` | Human review was required but absent. | Review-required gate plus missing review artifact. |

## Routing Policy

Eval cases are not execution permission.

Allowed routing values:

- `learning_only`: safe to store as training/eval evidence only.
- `blocked`: cannot be used until missing evidence or hard failure is fixed.
- `manual_review_required`: human review can inspect it, but execution remains blocked.

Forbidden in eval packets:

- `judgment_allowed=true`
- `paper_buy_allowed=true`
- `trade_allowed=true`
- `promote_allowed=true`

If any forbidden flag appears in an eval packet, the packet verifier must return
`blocked`.

## Hard vs Soft Signals

Hard signals are machine-checkable and can block automatically:

- Missing artifact.
- Hash mismatch.
- Duplicate request hash.
- Provider/model mismatch.
- HTTP 200 with empty or invalid JSON.
- Event trail missing or unparsable.
- Test/verifier failure.

Soft signals require review and cannot promote alone:

- The answer feels off-topic.
- The patch looks risky.
- The wording is ambiguous.
- The model gave a plausible but unverified explanation.

## Training vs Eval vs Review

Use cases this way:

- Training candidate: stable label, safe content, no secrets, no execution authority.
- Eval case: deterministic expected route and required evidence are known.
- Human review: ambiguous, high-impact, or soft-signal-heavy cases.
- Blocked: missing evidence, unsafe action escalation, or corrupted event flow.

## Stop Rules

Stop and mark `blocked` when:

- The event trail is missing or unparsable.
- The source artifact is missing.
- A learning packet contains any execution authorization.
- The provider route cannot be verified.
- The case contains secrets or personal data.
- The case relies on a stale output as fresh evidence.

21 changes: 21 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ python examples/03_langgraph_adapter.py
python examples/04_yijing_strategy.py
python examples/05_langgraph_regression_guard.py
python examples/06_hermes_skill_regression_guard.py
python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
```

## 01_basic_tracking.py
Expand Down Expand Up @@ -62,3 +63,23 @@ Proves the Hermes-style skill seam without a Hermes dependency:

Use this when someone asks how the package fits under Hermes, OpenClaw, or any
skill-based agent runtime instead of competing with it.

## agent_eval_cases.jsonl

Provides 30 non-authorizing eval cases for coding-agent, tool-calling,
provider-route, stale-artifact, rollback, and governance failures.

Verify the packet:

```bash
python examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
```

Expected boundary:

```text
judgment_allowed=false
paper_buy_allowed=false
trade_allowed=false
promote_allowed=false
```
Loading
Loading