diff --git a/README.md b/README.md index f28c882..acb2042 100644 --- a/README.md +++ b/README.md @@ -488,6 +488,7 @@ The imp@k (improvement-at-k) metrics system tracks performance deltas over time. | Metric | Formula | What It Tells You | |---|---|---| | **imp@week** | `avg_score(this_week) - avg_score(last_week)` | Weekly performance trajectory | +| **imp@baseline** | `avg_score(current_week) - avg_score(anchor_week)` | Cumulative drift from a fixed reference point | | **imp@skill** | `avg_score(after_skill) - avg_score(before_skill)` | Impact of a specific skill addition | | **imp@config** | `avg_score(after_change) - avg_score(before_change)` | Impact of a config change | | **imp@model** | `avg_score(new_model) - avg_score(old_model)` | Impact of switching models | @@ -502,6 +503,13 @@ imp@week < -0.2 → Regressing ↓ Three consecutive weeks of regression triggers an investigation. Not a panic, an investigation. Maybe the tasks got harder. Maybe the model provider shipped an update. Maybe a skill is interfering with other skills. The point is: you notice. +**imp@week's blind spot.** A slow monotonic degradation reads as "stable" every week. If an agent drops 0.05/week, each consecutive delta is within the stable band — but after 8 weeks you have lost 0.4 points. `imp@baseline` catches this: anchor to a fixed reference week and the cumulative drift is always visible. + +```bash +# Run both reports together +./metrics/scripts/compute-impk.sh --baseline 2026-W12 +``` + #### Example: imp@skill Measurement ```json diff --git a/evals/METHODOLOGY.md b/evals/METHODOLOGY.md index 9112dc3..bd849a8 100644 --- a/evals/METHODOLOGY.md +++ b/evals/METHODOLOGY.md @@ -107,6 +107,52 @@ Trend categories: Computed by: `scripts/compute-impk.sh` +### imp@baseline + +imp@baseline anchors performance to a fixed reference week instead of the immediately prior week. +This catches slow, monotonic regressions that imp@week misses. + +**The blind spot in imp@week alone:** +If an agent degrades from 4.5 → 4.3 → 4.1 → 3.9 over four weeks, each imp@week is −0.2 +(visible). But if the drop is 0.05/week the consecutive deltas all read "stable" while the +cumulative drift is −0.2+ — invisible to imp@week, visible to imp@baseline. + +``` +imp@baseline = avg_score(current_week) - avg_score(baseline_week) +``` + +Computed by: `scripts/compute-impk.sh --baseline ` + +**When to use it:** +- After 4+ weeks of data, set your first eval week as the baseline. +- After a significant config change, set that week as a new anchor to measure the + cumulative effect of subsequent tuning. +- Run both reports side by side: imp@week shows weekly volatility, imp@baseline shows + whether the agent is better or worse than when you started. + +**Example:** + +``` +$ ./compute-impk.sh 2026-W16 --baseline 2026-W12 + +imp@week report +Current week : 2026-W16 +Previous week: 2026-W15 +... + +imp@baseline report +Baseline week: 2026-W12 +Current week : 2026-W16 + +Agent Current Baseline imp@baseline Trend +------------ -------- -------- ------------ ---------- +Ada 3.70 4.50 -0.80 regressing +Rita 4.10 4.00 +0.10 stable +``` + +Ada's imp@week may have appeared "stable" for weeks, but imp@baseline reveals −0.80 +cumulative drift since the anchor — a real regression worth investigating. + ### imp@skill imp@skill measures the performance change after a skill is added or modified. diff --git a/metrics/README.md b/metrics/README.md index 57b9872..1950f16 100644 --- a/metrics/README.md +++ b/metrics/README.md @@ -11,6 +11,7 @@ Our adaptation for production agent systems: | Metric | What it measures | Formula | |--------|-----------------|---------| | **imp@week** | Weekly performance delta per agent | avg_score(week N) - avg_score(week N-1) | +| **imp@baseline** | Cumulative drift from a fixed anchor week | avg_score(current) - avg_score(baseline) | | **imp@skill** | Performance change after adding a skill | avg_score(after) - avg_score(before) | ### Trend Categories @@ -39,7 +40,7 @@ Gathers automated metrics from agent session logs and memory files. ### `compute-impk.sh` -Calculates imp@week by comparing consecutive weekly result files. +Calculates imp@week (and optionally imp@baseline) by comparing weekly result files. ```bash # Auto-detect latest two weeks @@ -50,8 +51,20 @@ Calculates imp@week by comparing consecutive weekly result files. # Compare two specific weeks ./scripts/compute-impk.sh 2026-W14 2026-W13 + +# Add baseline anchoring (catches slow multi-week regressions) +./scripts/compute-impk.sh --baseline 2026-W12 + +# Current week vs prior + both vs fixed baseline +./scripts/compute-impk.sh 2026-W16 --baseline 2026-W12 ``` +**Why use `--baseline`?** +`imp@week` compares each week to the previous one. A slow monotonic degradation +(e.g. −0.05/week) reads as "stable" every week while the cumulative loss grows. +`--baseline` anchors to a fixed reference week so total drift is always visible. +Use your first evaluation week or a post-stabilization week as the anchor. + ### `track-skill-impact.sh` Measures the before/after impact of adding or modifying a skill. diff --git a/metrics/scripts/compute-impk.sh b/metrics/scripts/compute-impk.sh index 2a34e73..245fdbc 100755 --- a/metrics/scripts/compute-impk.sh +++ b/metrics/scripts/compute-impk.sh @@ -1,14 +1,23 @@ #!/usr/bin/env bash -# compute-impk.sh -- Calculate imp@week per agent by comparing consecutive weekly result files. +# compute-impk.sh -- Calculate imp@week and imp@baseline per agent. # -# imp@week: performance delta (average score change vs previous week) -# imp@skill: performance delta after a skill was added (see track-skill-impact.sh) -# Trend: improving (>+0.1), stable (-0.1 to +0.1), regressing (<-0.1) +# imp@week: performance delta vs the immediately previous week +# imp@baseline: performance delta vs a fixed anchor week (monotonic drift detector) +# imp@skill: performance delta after a skill was added (see track-skill-impact.sh) +# Trend: improving (>+0.1), stable (-0.1 to +0.1), regressing (<-0.1) # # Usage: -# ./compute-impk.sh # compare latest two weeks -# ./compute-impk.sh 2026-W14 # compare W14 vs W13 -# ./compute-impk.sh 2026-W14 2026-W13 # compare specific pair +# ./compute-impk.sh # compare latest two weeks +# ./compute-impk.sh 2026-W14 # compare W14 vs W13 +# ./compute-impk.sh 2026-W14 2026-W13 # compare specific pair +# ./compute-impk.sh --baseline 2026-W12 # W-latest vs W12 anchor +# ./compute-impk.sh 2026-W14 --baseline 2026-W12 # W14 vs W13 + W14 vs W12 anchor +# +# Why --baseline matters: +# If an agent degrades monotonically across several weeks, imp@week stays near +# zero (each degraded week looks similar to the previous degraded week), masking +# a real long-term regression. Anchoring to a fixed baseline week makes that +# drift visible as imp@baseline = current_avg - baseline_avg. set -euo pipefail @@ -72,7 +81,6 @@ get_agent_avg() { # Trend label based on delta trend_label() { local delta="$1" - # Use awk for float comparison echo "$delta" | awk '{ if ($1 > 0.1) print "improving" else if ($1 < -0.1) print "regressing" @@ -110,8 +118,34 @@ resolve_week_file() { CURRENT_FILE="" PREVIOUS_FILE="" +BASELINE_FILE="" +BASELINE_WEEK="" + +# Parse args: support positional week labels and --baseline flag +positional=() +while [[ $# -gt 0 ]]; do + case "$1" in + --baseline) + shift + BASELINE_WEEK="${1:-}" + if [ -z "$BASELINE_WEEK" ]; then + echo "Error: --baseline requires a week label (e.g. 2026-W12)" >&2 + exit 1 + fi + BASELINE_FILE="$(resolve_week_file "$BASELINE_WEEK")" + if [ -z "$BASELINE_FILE" ]; then + echo "Baseline file not found: $RESULTS_DIR/$BASELINE_WEEK.md" >&2 + exit 1 + fi + ;; + *) + positional+=("$1") + ;; + esac + shift +done -if [ $# -eq 0 ]; then +if [ "${#positional[@]}" -eq 0 ]; then mapfile -t candidates < <(find_latest_two) if [ "${#candidates[@]}" -lt 1 ]; then echo "No result files found in $RESULTS_DIR" >&2 @@ -121,10 +155,10 @@ if [ $# -eq 0 ]; then if [ "${#candidates[@]}" -ge 2 ]; then PREVIOUS_FILE="${candidates[-2]}" fi -elif [ $# -eq 1 ]; then - CURRENT_FILE="$(resolve_week_file "$1")" +elif [ "${#positional[@]}" -eq 1 ]; then + CURRENT_FILE="$(resolve_week_file "${positional[0]}")" if [ -z "$CURRENT_FILE" ]; then - echo "File not found: $RESULTS_DIR/$1.md" >&2 + echo "File not found: $RESULTS_DIR/${positional[0]}.md" >&2 exit 1 fi # Find the one before it @@ -134,22 +168,25 @@ elif [ $# -eq 1 ]; then PREVIOUS_FILE="${all_files[$((i-1))]}" fi done -elif [ $# -eq 2 ]; then - CURRENT_FILE="$(resolve_week_file "$1")" - PREVIOUS_FILE="$(resolve_week_file "$2")" +elif [ "${#positional[@]}" -eq 2 ]; then + CURRENT_FILE="$(resolve_week_file "${positional[0]}")" + PREVIOUS_FILE="$(resolve_week_file "${positional[1]}")" if [ -z "$CURRENT_FILE" ]; then - echo "File not found: $RESULTS_DIR/$1.md" >&2; exit 1 + echo "File not found: $RESULTS_DIR/${positional[0]}.md" >&2; exit 1 fi if [ -z "$PREVIOUS_FILE" ]; then - echo "File not found: $RESULTS_DIR/$2.md" >&2; exit 1 + echo "File not found: $RESULTS_DIR/${positional[1]}.md" >&2; exit 1 fi +else + echo "Usage: compute-impk.sh [CURRENT_WEEK [PREVIOUS_WEEK]] [--baseline BASELINE_WEEK]" >&2 + exit 1 fi CURRENT_WEEK="$(basename "$CURRENT_FILE" .md)" PREVIOUS_WEEK="" [ -n "$PREVIOUS_FILE" ] && PREVIOUS_WEEK="$(basename "$PREVIOUS_FILE" .md)" -# --- Compute and Print --- +# --- imp@week Report --- echo "imp@week report" echo "Current week : $CURRENT_WEEK" @@ -162,8 +199,10 @@ echo "" printf "%-12s %-8s %-8s %-10s %-10s\n" "Agent" "Current" "Previous" "imp@week" "Trend" printf "%-12s %-8s %-8s %-10s %-10s\n" "------------" "--------" "--------" "----------" "----------" +declare -A current_avgs for agent in "${AGENTS[@]}"; do current_avg="$(get_agent_avg "$CURRENT_FILE" "$agent")" + current_avgs["$agent"]="$current_avg" previous_avg="" [ -n "$PREVIOUS_FILE" ] && previous_avg="$(get_agent_avg "$PREVIOUS_FILE" "$agent")" @@ -186,6 +225,49 @@ done echo "" +# --- imp@baseline Report (only when --baseline provided) --- +# +# Why this matters: +# imp@week can miss slow multi-week regressions. If an agent drops from 4.5 to +# 4.2 to 3.9 to 3.6 over four weeks, each imp@week is -0.3 (visible), but if +# the drop is 0.05/week it reads as "stable" every week while the cumulative +# loss is 0.2+. Anchoring to a fixed baseline week makes the total drift +# visible regardless of the step size. + +if [ -n "$BASELINE_FILE" ]; then + echo "imp@baseline report" + echo "Baseline week: $BASELINE_WEEK" + echo "Current week : $CURRENT_WEEK" + echo "" + printf "%-12s %-8s %-8s %-12s %-10s\n" "Agent" "Current" "Baseline" "imp@baseline" "Trend" + printf "%-12s %-8s %-8s %-12s %-10s\n" "------------" "--------" "--------" "------------" "----------" + + for agent in "${AGENTS[@]}"; do + current_avg="${current_avgs[$agent]:-}" + # Re-fetch current if not yet set (shouldn't happen, but defensive) + [ -z "$current_avg" ] && current_avg="$(get_agent_avg "$CURRENT_FILE" "$agent")" + baseline_avg="$(get_agent_avg "$BASELINE_FILE" "$agent")" + + if [ -z "$current_avg" ]; then + printf "%-12s %-8s %-8s %-12s %-10s\n" "$agent" "(none)" "${baseline_avg:-(none)}" "N/A" "no data" + continue + fi + + if [ -z "$baseline_avg" ]; then + printf "%-12s %-8s %-8s %-12s %-10s\n" "$agent" "$current_avg" "(none)" "N/A" "no baseline" + continue + fi + + delta="$(echo "$current_avg $baseline_avg" | awk '{printf "%.2f", $1-$2}')" + signed_delta="$(echo "$delta" | awk '{if ($1>0) printf "+%.2f", $1; else printf "%.2f", $1}')" + trend="$(trend_label "$delta")" + + printf "%-12s %-8s %-8s %-12s %-10s\n" "$agent" "$current_avg" "$baseline_avg" "$signed_delta" "$trend" + done + + echo "" +fi + # --- imp@skill (from skill-impacts.json if present) --- SKILL_IMPACTS="$RESULTS_DIR/skill-impacts.json" if [ -f "$SKILL_IMPACTS" ] && command -v jq &>/dev/null; then