Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -488,6 +488,7 @@ The imp@k (improvement-at-k) metrics system tracks performance deltas over time.
| Metric | Formula | What It Tells You |
|---|---|---|
| **imp@week** | `avg_score(this_week) - avg_score(last_week)` | Weekly performance trajectory |
| **imp@baseline** | `avg_score(current_week) - avg_score(anchor_week)` | Cumulative drift from a fixed reference point |
| **imp@skill** | `avg_score(after_skill) - avg_score(before_skill)` | Impact of a specific skill addition |
| **imp@config** | `avg_score(after_change) - avg_score(before_change)` | Impact of a config change |
| **imp@model** | `avg_score(new_model) - avg_score(old_model)` | Impact of switching models |
Expand All @@ -502,6 +503,13 @@ imp@week < -0.2 → Regressing ↓

Three consecutive weeks of regression triggers an investigation. Not a panic, an investigation. Maybe the tasks got harder. Maybe the model provider shipped an update. Maybe a skill is interfering with other skills. The point is: you notice.

**imp@week's blind spot.** A slow monotonic degradation reads as "stable" every week. If an agent drops 0.05/week, each consecutive delta is within the stable band — but after 8 weeks you have lost 0.4 points. `imp@baseline` catches this: anchor to a fixed reference week and the cumulative drift is always visible.

```bash
# Run both reports together
./metrics/scripts/compute-impk.sh --baseline 2026-W12
```

#### Example: imp@skill Measurement

```json
Expand Down
46 changes: 46 additions & 0 deletions evals/METHODOLOGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,52 @@ Trend categories:

Computed by: `scripts/compute-impk.sh`

### imp@baseline

imp@baseline anchors performance to a fixed reference week instead of the immediately prior week.
This catches slow, monotonic regressions that imp@week misses.

**The blind spot in imp@week alone:**
If an agent degrades from 4.5 → 4.3 → 4.1 → 3.9 over four weeks, each imp@week is −0.2
(visible). But if the drop is 0.05/week the consecutive deltas all read "stable" while the
cumulative drift is −0.2+ — invisible to imp@week, visible to imp@baseline.

```
imp@baseline = avg_score(current_week) - avg_score(baseline_week)
```

Computed by: `scripts/compute-impk.sh --baseline <WEEK>`

**When to use it:**
- After 4+ weeks of data, set your first eval week as the baseline.
- After a significant config change, set that week as a new anchor to measure the
cumulative effect of subsequent tuning.
- Run both reports side by side: imp@week shows weekly volatility, imp@baseline shows
whether the agent is better or worse than when you started.

**Example:**

```
$ ./compute-impk.sh 2026-W16 --baseline 2026-W12

imp@week report
Current week : 2026-W16
Previous week: 2026-W15
...

imp@baseline report
Baseline week: 2026-W12
Current week : 2026-W16

Agent Current Baseline imp@baseline Trend
------------ -------- -------- ------------ ----------
Ada 3.70 4.50 -0.80 regressing
Rita 4.10 4.00 +0.10 stable
```

Ada's imp@week may have appeared "stable" for weeks, but imp@baseline reveals −0.80
cumulative drift since the anchor — a real regression worth investigating.

### imp@skill

imp@skill measures the performance change after a skill is added or modified.
Expand Down
15 changes: 14 additions & 1 deletion metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Our adaptation for production agent systems:
| Metric | What it measures | Formula |
|--------|-----------------|---------|
| **imp@week** | Weekly performance delta per agent | avg_score(week N) - avg_score(week N-1) |
| **imp@baseline** | Cumulative drift from a fixed anchor week | avg_score(current) - avg_score(baseline) |
| **imp@skill** | Performance change after adding a skill | avg_score(after) - avg_score(before) |

### Trend Categories
Expand Down Expand Up @@ -39,7 +40,7 @@ Gathers automated metrics from agent session logs and memory files.

### `compute-impk.sh`

Calculates imp@week by comparing consecutive weekly result files.
Calculates imp@week (and optionally imp@baseline) by comparing weekly result files.

```bash
# Auto-detect latest two weeks
Expand All @@ -50,8 +51,20 @@ Calculates imp@week by comparing consecutive weekly result files.

# Compare two specific weeks
./scripts/compute-impk.sh 2026-W14 2026-W13

# Add baseline anchoring (catches slow multi-week regressions)
./scripts/compute-impk.sh --baseline 2026-W12

# Current week vs prior + both vs fixed baseline
./scripts/compute-impk.sh 2026-W16 --baseline 2026-W12
```

**Why use `--baseline`?**
`imp@week` compares each week to the previous one. A slow monotonic degradation
(e.g. −0.05/week) reads as "stable" every week while the cumulative loss grows.
`--baseline` anchors to a fixed reference week so total drift is always visible.
Use your first evaluation week or a post-stabilization week as the anchor.

### `track-skill-impact.sh`

Measures the before/after impact of adding or modifying a skill.
Expand Down
118 changes: 100 additions & 18 deletions metrics/scripts/compute-impk.sh
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
#!/usr/bin/env bash
# compute-impk.sh -- Calculate imp@week per agent by comparing consecutive weekly result files.
# compute-impk.sh -- Calculate imp@week and imp@baseline per agent.
#
# imp@week: performance delta (average score change vs previous week)
# imp@skill: performance delta after a skill was added (see track-skill-impact.sh)
# Trend: improving (>+0.1), stable (-0.1 to +0.1), regressing (<-0.1)
# imp@week: performance delta vs the immediately previous week
# imp@baseline: performance delta vs a fixed anchor week (monotonic drift detector)
# imp@skill: performance delta after a skill was added (see track-skill-impact.sh)
# Trend: improving (>+0.1), stable (-0.1 to +0.1), regressing (<-0.1)
#
# Usage:
# ./compute-impk.sh # compare latest two weeks
# ./compute-impk.sh 2026-W14 # compare W14 vs W13
# ./compute-impk.sh 2026-W14 2026-W13 # compare specific pair
# ./compute-impk.sh # compare latest two weeks
# ./compute-impk.sh 2026-W14 # compare W14 vs W13
# ./compute-impk.sh 2026-W14 2026-W13 # compare specific pair
# ./compute-impk.sh --baseline 2026-W12 # W-latest vs W12 anchor
# ./compute-impk.sh 2026-W14 --baseline 2026-W12 # W14 vs W13 + W14 vs W12 anchor
#
# Why --baseline matters:
# If an agent degrades monotonically across several weeks, imp@week stays near
# zero (each degraded week looks similar to the previous degraded week), masking
# a real long-term regression. Anchoring to a fixed baseline week makes that
# drift visible as imp@baseline = current_avg - baseline_avg.

set -euo pipefail

Expand Down Expand Up @@ -72,7 +81,6 @@ get_agent_avg() {
# Trend label based on delta
trend_label() {
local delta="$1"
# Use awk for float comparison
echo "$delta" | awk '{
if ($1 > 0.1) print "improving"
else if ($1 < -0.1) print "regressing"
Expand Down Expand Up @@ -110,8 +118,34 @@ resolve_week_file() {

CURRENT_FILE=""
PREVIOUS_FILE=""
BASELINE_FILE=""
BASELINE_WEEK=""

# Parse args: support positional week labels and --baseline flag
positional=()
while [[ $# -gt 0 ]]; do
case "$1" in
--baseline)
shift
BASELINE_WEEK="${1:-}"
if [ -z "$BASELINE_WEEK" ]; then
echo "Error: --baseline requires a week label (e.g. 2026-W12)" >&2
exit 1
fi
BASELINE_FILE="$(resolve_week_file "$BASELINE_WEEK")"
if [ -z "$BASELINE_FILE" ]; then
echo "Baseline file not found: $RESULTS_DIR/$BASELINE_WEEK.md" >&2
exit 1
fi
;;
*)
positional+=("$1")
;;
esac
shift
done

if [ $# -eq 0 ]; then
if [ "${#positional[@]}" -eq 0 ]; then
mapfile -t candidates < <(find_latest_two)
if [ "${#candidates[@]}" -lt 1 ]; then
echo "No result files found in $RESULTS_DIR" >&2
Expand All @@ -121,10 +155,10 @@ if [ $# -eq 0 ]; then
if [ "${#candidates[@]}" -ge 2 ]; then
PREVIOUS_FILE="${candidates[-2]}"
fi
elif [ $# -eq 1 ]; then
CURRENT_FILE="$(resolve_week_file "$1")"
elif [ "${#positional[@]}" -eq 1 ]; then
CURRENT_FILE="$(resolve_week_file "${positional[0]}")"
if [ -z "$CURRENT_FILE" ]; then
echo "File not found: $RESULTS_DIR/$1.md" >&2
echo "File not found: $RESULTS_DIR/${positional[0]}.md" >&2
exit 1
fi
# Find the one before it
Expand All @@ -134,22 +168,25 @@ elif [ $# -eq 1 ]; then
PREVIOUS_FILE="${all_files[$((i-1))]}"
fi
done
elif [ $# -eq 2 ]; then
CURRENT_FILE="$(resolve_week_file "$1")"
PREVIOUS_FILE="$(resolve_week_file "$2")"
elif [ "${#positional[@]}" -eq 2 ]; then
CURRENT_FILE="$(resolve_week_file "${positional[0]}")"
PREVIOUS_FILE="$(resolve_week_file "${positional[1]}")"
if [ -z "$CURRENT_FILE" ]; then
echo "File not found: $RESULTS_DIR/$1.md" >&2; exit 1
echo "File not found: $RESULTS_DIR/${positional[0]}.md" >&2; exit 1
fi
if [ -z "$PREVIOUS_FILE" ]; then
echo "File not found: $RESULTS_DIR/$2.md" >&2; exit 1
echo "File not found: $RESULTS_DIR/${positional[1]}.md" >&2; exit 1
fi
else
echo "Usage: compute-impk.sh [CURRENT_WEEK [PREVIOUS_WEEK]] [--baseline BASELINE_WEEK]" >&2
exit 1
fi

CURRENT_WEEK="$(basename "$CURRENT_FILE" .md)"
PREVIOUS_WEEK=""
[ -n "$PREVIOUS_FILE" ] && PREVIOUS_WEEK="$(basename "$PREVIOUS_FILE" .md)"

# --- Compute and Print ---
# --- imp@week Report ---

echo "imp@week report"
echo "Current week : $CURRENT_WEEK"
Expand All @@ -162,8 +199,10 @@ echo ""
printf "%-12s %-8s %-8s %-10s %-10s\n" "Agent" "Current" "Previous" "imp@week" "Trend"
printf "%-12s %-8s %-8s %-10s %-10s\n" "------------" "--------" "--------" "----------" "----------"

declare -A current_avgs
for agent in "${AGENTS[@]}"; do
current_avg="$(get_agent_avg "$CURRENT_FILE" "$agent")"
current_avgs["$agent"]="$current_avg"
previous_avg=""
[ -n "$PREVIOUS_FILE" ] && previous_avg="$(get_agent_avg "$PREVIOUS_FILE" "$agent")"

Expand All @@ -186,6 +225,49 @@ done

echo ""

# --- imp@baseline Report (only when --baseline provided) ---
#
# Why this matters:
# imp@week can miss slow multi-week regressions. If an agent drops from 4.5 to
# 4.2 to 3.9 to 3.6 over four weeks, each imp@week is -0.3 (visible), but if
# the drop is 0.05/week it reads as "stable" every week while the cumulative
# loss is 0.2+. Anchoring to a fixed baseline week makes the total drift
# visible regardless of the step size.

if [ -n "$BASELINE_FILE" ]; then
echo "imp@baseline report"
echo "Baseline week: $BASELINE_WEEK"
echo "Current week : $CURRENT_WEEK"
echo ""
printf "%-12s %-8s %-8s %-12s %-10s\n" "Agent" "Current" "Baseline" "imp@baseline" "Trend"
printf "%-12s %-8s %-8s %-12s %-10s\n" "------------" "--------" "--------" "------------" "----------"

for agent in "${AGENTS[@]}"; do
current_avg="${current_avgs[$agent]:-}"
# Re-fetch current if not yet set (shouldn't happen, but defensive)
[ -z "$current_avg" ] && current_avg="$(get_agent_avg "$CURRENT_FILE" "$agent")"
baseline_avg="$(get_agent_avg "$BASELINE_FILE" "$agent")"

if [ -z "$current_avg" ]; then
printf "%-12s %-8s %-8s %-12s %-10s\n" "$agent" "(none)" "${baseline_avg:-(none)}" "N/A" "no data"
continue
fi

if [ -z "$baseline_avg" ]; then
printf "%-12s %-8s %-8s %-12s %-10s\n" "$agent" "$current_avg" "(none)" "N/A" "no baseline"
continue
fi

delta="$(echo "$current_avg $baseline_avg" | awk '{printf "%.2f", $1-$2}')"
signed_delta="$(echo "$delta" | awk '{if ($1>0) printf "+%.2f", $1; else printf "%.2f", $1}')"
trend="$(trend_label "$delta")"

printf "%-12s %-8s %-8s %-12s %-10s\n" "$agent" "$current_avg" "$baseline_avg" "$signed_delta" "$trend"
done

echo ""
fi

# --- imp@skill (from skill-impacts.json if present) ---
SKILL_IMPACTS="$RESULTS_DIR/skill-impacts.json"
if [ -f "$SKILL_IMPACTS" ] && command -v jq &>/dev/null; then
Expand Down