choutos · nanookclaw · Mar 29, 2026
diff --git a/README.md b/README.md
@@ -488,6 +488,7 @@ The imp@k (improvement-at-k) metrics system tracks performance deltas over time.
 | Metric | Formula | What It Tells You |
 |---|---|---|
 | **imp@week** | `avg_score(this_week) - avg_score(last_week)` | Weekly performance trajectory |
+| **imp@baseline** | `avg_score(current_week) - avg_score(anchor_week)` | Cumulative drift from a fixed reference point |
 | **imp@skill** | `avg_score(after_skill) - avg_score(before_skill)` | Impact of a specific skill addition |
 | **imp@config** | `avg_score(after_change) - avg_score(before_change)` | Impact of a config change |
 | **imp@model** | `avg_score(new_model) - avg_score(old_model)` | Impact of switching models |
@@ -502,6 +503,13 @@ imp@week < -0.2  → Regressing  ↓
 
 Three consecutive weeks of regression triggers an investigation. Not a panic, an investigation. Maybe the tasks got harder. Maybe the model provider shipped an update. Maybe a skill is interfering with other skills. The point is: you notice.
 
+**imp@week's blind spot.** A slow monotonic degradation reads as "stable" every week. If an agent drops 0.05/week, each consecutive delta is within the stable band — but after 8 weeks you have lost 0.4 points. `imp@baseline` catches this: anchor to a fixed reference week and the cumulative drift is always visible.
+
+```bash
+# Run both reports together
+./metrics/scripts/compute-impk.sh --baseline 2026-W12
+```
+
 #### Example: imp@skill Measurement
 
 ```json

diff --git a/evals/METHODOLOGY.md b/evals/METHODOLOGY.md
@@ -107,6 +107,52 @@ Trend categories:
 
 Computed by: `scripts/compute-impk.sh`
 
+### imp@baseline
+
+imp@baseline anchors performance to a fixed reference week instead of the immediately prior week.
+This catches slow, monotonic regressions that imp@week misses.
+
+**The blind spot in imp@week alone:**
+If an agent degrades from 4.5 → 4.3 → 4.1 → 3.9 over four weeks, each imp@week is −0.2
+(visible). But if the drop is 0.05/week the consecutive deltas all read "stable" while the
+cumulative drift is −0.2+ — invisible to imp@week, visible to imp@baseline.
+
+```
+imp@baseline = avg_score(current_week) - avg_score(baseline_week)
+```
+
+Computed by: `scripts/compute-impk.sh --baseline <WEEK>`
+
+**When to use it:**
+- After 4+ weeks of data, set your first eval week as the baseline.
+- After a significant config change, set that week as a new anchor to measure the
+  cumulative effect of subsequent tuning.
+- Run both reports side by side: imp@week shows weekly volatility, imp@baseline shows
+  whether the agent is better or worse than when you started.
+
+**Example:**
+
+```
+$ ./compute-impk.sh 2026-W16 --baseline 2026-W12
+
+imp@week report
+Current week : 2026-W16
+Previous week: 2026-W15
+...
+
+imp@baseline report
+Baseline week: 2026-W12
+Current week : 2026-W16
+
+Agent        Current   Baseline  imp@baseline  Trend
+------------ --------  --------  ------------  ----------
+Ada          3.70      4.50      -0.80         regressing
+Rita         4.10      4.00      +0.10         stable
+```
+
+Ada's imp@week may have appeared "stable" for weeks, but imp@baseline reveals −0.80
+cumulative drift since the anchor — a real regression worth investigating.
+
 ### imp@skill
 
 imp@skill measures the performance change after a skill is added or modified.

diff --git a/metrics/README.md b/metrics/README.md
@@ -11,6 +11,7 @@ Our adaptation for production agent systems:
 | Metric | What it measures | Formula |
 |--------|-----------------|---------|
 | **imp@week** | Weekly performance delta per agent | avg_score(week N) - avg_score(week N-1) |
+| **imp@baseline** | Cumulative drift from a fixed anchor week | avg_score(current) - avg_score(baseline) |
 | **imp@skill** | Performance change after adding a skill | avg_score(after) - avg_score(before) |
 
 ### Trend Categories
@@ -39,7 +40,7 @@ Gathers automated metrics from agent session logs and memory files.
 
 ### `compute-impk.sh`
 
-Calculates imp@week by comparing consecutive weekly result files.
+Calculates imp@week (and optionally imp@baseline) by comparing weekly result files.
 
 ```bash
 # Auto-detect latest two weeks
@@ -50,8 +51,20 @@ Calculates imp@week by comparing consecutive weekly result files.
 
 # Compare two specific weeks
 ./scripts/compute-impk.sh 2026-W14 2026-W13
+
+# Add baseline anchoring (catches slow multi-week regressions)
+./scripts/compute-impk.sh --baseline 2026-W12
+
+# Current week vs prior + both vs fixed baseline
+./scripts/compute-impk.sh 2026-W16 --baseline 2026-W12
 ```
 
+**Why use `--baseline`?**
+`imp@week` compares each week to the previous one. A slow monotonic degradation
+(e.g. −0.05/week) reads as "stable" every week while the cumulative loss grows.
+`--baseline` anchors to a fixed reference week so total drift is always visible.
+Use your first evaluation week or a post-stabilization week as the anchor.
+
 ### `track-skill-impact.sh`
 
 Measures the before/after impact of adding or modifying a skill.

diff --git a/metrics/scripts/compute-impk.sh b/metrics/scripts/compute-impk.sh
@@ -1,14 +1,23 @@
 #!/usr/bin/env bash
-# compute-impk.sh -- Calculate imp@week per agent by comparing consecutive weekly result files.
+# compute-impk.sh -- Calculate imp@week and imp@baseline per agent.
 #
-# imp@week: performance delta (average score change vs previous week)
-# imp@skill: performance delta after a skill was added (see track-skill-impact.sh)
-# Trend: improving (>+0.1), stable (-0.1 to +0.1), regressing (<-0.1)
+# imp@week:     performance delta vs the immediately previous week
+# imp@baseline: performance delta vs a fixed anchor week (monotonic drift detector)
+# imp@skill:    performance delta after a skill was added (see track-skill-impact.sh)
+# Trend:        improving (>+0.1), stable (-0.1 to +0.1), regressing (<-0.1)
 #
 # Usage:
-#   ./compute-impk.sh                    # compare latest two weeks
-#   ./compute-impk.sh 2026-W14           # compare W14 vs W13
-#   ./compute-impk.sh 2026-W14 2026-W13  # compare specific pair
+#   ./compute-impk.sh                              # compare latest two weeks
+#   ./compute-impk.sh 2026-W14                     # compare W14 vs W13
+#   ./compute-impk.sh 2026-W14 2026-W13            # compare specific pair
+#   ./compute-impk.sh --baseline 2026-W12          # W-latest vs W12 anchor
+#   ./compute-impk.sh 2026-W14 --baseline 2026-W12 # W14 vs W13 + W14 vs W12 anchor
+#
+# Why --baseline matters:
+#   If an agent degrades monotonically across several weeks, imp@week stays near
+#   zero (each degraded week looks similar to the previous degraded week), masking
+#   a real long-term regression.  Anchoring to a fixed baseline week makes that
+#   drift visible as imp@baseline = current_avg - baseline_avg.
 
 set -euo pipefail
 
@@ -72,7 +81,6 @@ get_agent_avg() {
 # Trend label based on delta
 trend_label() {
     local delta="$1"
-    # Use awk for float comparison
     echo "$delta" | awk '{
         if ($1 > 0.1) print "improving"
         else if ($1 < -0.1) print "regressing"
@@ -110,8 +118,34 @@ resolve_week_file() {
 
 CURRENT_FILE=""
 PREVIOUS_FILE=""
+BASELINE_FILE=""
+BASELINE_WEEK=""
+
+# Parse args: support positional week labels and --baseline flag
+positional=()
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --baseline)
+            shift
+            BASELINE_WEEK="${1:-}"
+            if [ -z "$BASELINE_WEEK" ]; then
+                echo "Error: --baseline requires a week label (e.g. 2026-W12)" >&2
+                exit 1
+            fi
+            BASELINE_FILE="$(resolve_week_file "$BASELINE_WEEK")"
+            if [ -z "$BASELINE_FILE" ]; then
+                echo "Baseline file not found: $RESULTS_DIR/$BASELINE_WEEK.md" >&2
+                exit 1
+            fi
+            ;;
+        *)
+            positional+=("$1")
+            ;;
+    esac
+    shift
+done
 
-if [ $# -eq 0 ]; then
+if [ "${#positional[@]}" -eq 0 ]; then
     mapfile -t candidates < <(find_latest_two)
     if [ "${#candidates[@]}" -lt 1 ]; then
         echo "No result files found in $RESULTS_DIR" >&2
@@ -121,10 +155,10 @@ if [ $# -eq 0 ]; then
     if [ "${#candidates[@]}" -ge 2 ]; then
         PREVIOUS_FILE="${candidates[-2]}"
     fi
-elif [ $# -eq 1 ]; then
-    CURRENT_FILE="$(resolve_week_file "$1")"
+elif [ "${#positional[@]}" -eq 1 ]; then
+    CURRENT_FILE="$(resolve_week_file "${positional[0]}")"
     if [ -z "$CURRENT_FILE" ]; then
-        echo "File not found: $RESULTS_DIR/$1.md" >&2
+        echo "File not found: $RESULTS_DIR/${positional[0]}.md" >&2
         exit 1
     fi
     # Find the one before it
@@ -134,22 +168,25 @@ elif [ $# -eq 1 ]; then
             PREVIOUS_FILE="${all_files[$((i-1))]}"
         fi
     done
-elif [ $# -eq 2 ]; then
-    CURRENT_FILE="$(resolve_week_file "$1")"
-    PREVIOUS_FILE="$(resolve_week_file "$2")"
+elif [ "${#positional[@]}" -eq 2 ]; then
+    CURRENT_FILE="$(resolve_week_file "${positional[0]}")"
+    PREVIOUS_FILE="$(resolve_week_file "${positional[1]}")"
     if [ -z "$CURRENT_FILE" ]; then
-        echo "File not found: $RESULTS_DIR/$1.md" >&2; exit 1
+        echo "File not found: $RESULTS_DIR/${positional[0]}.md" >&2; exit 1
     fi
     if [ -z "$PREVIOUS_FILE" ]; then
-        echo "File not found: $RESULTS_DIR/$2.md" >&2; exit 1
+        echo "File not found: $RESULTS_DIR/${positional[1]}.md" >&2; exit 1
     fi
+else
+    echo "Usage: compute-impk.sh [CURRENT_WEEK [PREVIOUS_WEEK]] [--baseline BASELINE_WEEK]" >&2
+    exit 1
 fi
 
 CURRENT_WEEK="$(basename "$CURRENT_FILE" .md)"
 PREVIOUS_WEEK=""
 [ -n "$PREVIOUS_FILE" ] && PREVIOUS_WEEK="$(basename "$PREVIOUS_FILE" .md)"
 
-# --- Compute and Print ---
+# --- imp@week Report ---
 
 echo "imp@week report"
 echo "Current week : $CURRENT_WEEK"
@@ -162,8 +199,10 @@ echo ""
 printf "%-12s  %-8s  %-8s  %-10s  %-10s\n" "Agent" "Current" "Previous" "imp@week" "Trend"
 printf "%-12s  %-8s  %-8s  %-10s  %-10s\n" "------------" "--------" "--------" "----------" "----------"
 
+declare -A current_avgs
 for agent in "${AGENTS[@]}"; do
     current_avg="$(get_agent_avg "$CURRENT_FILE" "$agent")"
+    current_avgs["$agent"]="$current_avg"
     previous_avg=""
     [ -n "$PREVIOUS_FILE" ] && previous_avg="$(get_agent_avg "$PREVIOUS_FILE" "$agent")"
 
@@ -186,6 +225,49 @@ done
 
 echo ""
 
+# --- imp@baseline Report (only when --baseline provided) ---
+#
+# Why this matters:
+#   imp@week can miss slow multi-week regressions. If an agent drops from 4.5 to
+#   4.2 to 3.9 to 3.6 over four weeks, each imp@week is -0.3 (visible), but if
+#   the drop is 0.05/week it reads as "stable" every week while the cumulative
+#   loss is 0.2+.  Anchoring to a fixed baseline week makes the total drift
+#   visible regardless of the step size.
+
+if [ -n "$BASELINE_FILE" ]; then
+    echo "imp@baseline report"
+    echo "Baseline week: $BASELINE_WEEK"
+    echo "Current week : $CURRENT_WEEK"
+    echo ""
+    printf "%-12s  %-8s  %-8s  %-12s  %-10s\n" "Agent" "Current" "Baseline" "imp@baseline" "Trend"
+    printf "%-12s  %-8s  %-8s  %-12s  %-10s\n" "------------" "--------" "--------" "------------" "----------"
+
+    for agent in "${AGENTS[@]}"; do
+        current_avg="${current_avgs[$agent]:-}"
+        # Re-fetch current if not yet set (shouldn't happen, but defensive)
+        [ -z "$current_avg" ] && current_avg="$(get_agent_avg "$CURRENT_FILE" "$agent")"
+        baseline_avg="$(get_agent_avg "$BASELINE_FILE" "$agent")"
+
+        if [ -z "$current_avg" ]; then
+            printf "%-12s  %-8s  %-8s  %-12s  %-10s\n" "$agent" "(none)" "${baseline_avg:-(none)}" "N/A" "no data"
+            continue
+        fi
+
+        if [ -z "$baseline_avg" ]; then
+            printf "%-12s  %-8s  %-8s  %-12s  %-10s\n" "$agent" "$current_avg" "(none)" "N/A" "no baseline"
+            continue
+        fi
+
+        delta="$(echo "$current_avg $baseline_avg" | awk '{printf "%.2f", $1-$2}')"
+        signed_delta="$(echo "$delta" | awk '{if ($1>0) printf "+%.2f", $1; else printf "%.2f", $1}')"
+        trend="$(trend_label "$delta")"
+
+        printf "%-12s  %-8s  %-8s  %-12s  %-10s\n" "$agent" "$current_avg" "$baseline_avg" "$signed_delta" "$trend"
+    done
+
+    echo ""
+fi
+
 # --- imp@skill (from skill-impacts.json if present) ---
 SKILL_IMPACTS="$RESULTS_DIR/skill-impacts.json"
 if [ -f "$SKILL_IMPACTS" ] && command -v jq &>/dev/null; then