From 47ad05437499542f97cd54283248fad2fb9a6069 Mon Sep 17 00:00:00 2001
From: Eva <arncalso@gmail.com>
Date: Fri, 19 Jun 2026 23:08:18 +0700
Subject: [PATCH] =?UTF-8?q?docs:=20annotate=20the=20tightened=20scoring=20?=
 =?UTF-8?q?ruler=20(sc=5Fd4b9=E2=80=A6)=20=E2=80=94=20scores=20NOT=20compa?=
 =?UTF-8?q?rable=20across=20rulers?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The 2026-06 cycle materially tightened the scoring ruler (feature-engagement coverage scorer #1018,
acts felt-shape #1001/#1002, betrayal un-inversion #999, romance gate #997, dm_advanced_time unmask
#1024, gate-severity accuracy #1030), so a run scores LOWER under sc_d4b93982763a/lc_d7fcfddd5bf7
than under the v1.0.4 rulers — BY DESIGN (the scorer is a tightening feedback loop). Document the
ruler-version mechanism + history in SCORING.md §0 and annotate it in the v1.0.5-rc1 CHANGELOG, so
current numbers are never mis-compared to historic ones (every scores_db row is fenced by
scoring_config_version/lens_config_version). Stable-checkpoint hygiene per owner.
---
 CHANGELOG.md  | 20 ++++++++++++++++++--
 qa/SCORING.md | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 13742cae..eb0a9db4 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -51,8 +51,24 @@ architecture* so the gameplay work that follows is built on honest, un-gamed sig
 - Story-engagement feedback loop (auto-seeded approval vocabulary, feature-engagement coverage
   scorer, companion-quest orphan cue) (#1017–#1024); acts-engine runtime + felt-shape scorer
   (#1001/#1002); weighted approval + diminishing returns + inter-companion stance (#1003–#1005).
-- Docs current: MODEL-TIERING (the GLM lane + the honest 1-v-1 numbers), SCORING (gate severity as
-  honest measurement, *not* score-gaming; the coercion contract; timing columns), the runbooks (#1031).
+
+### ⚠ Scoring ruler tightened — current scores are NOT comparable to historic numbers
+The felt-world machinery above also **tightened the scoring ruler** to `sc_d4b93982763a` /
+`lc_d7fcfddd5bf7`. The feature-engagement coverage scorer + forcing gate (#1018), the acts felt-shape
++ flat-arc gate (#1001/#1002), betrayal un-inversion (#999), the romance gate (#997), the
+`dm_advanced_time` unmask (#1024), and the gate-severity *accuracy* repair (#1030) now DEMAND that
+companions / quests / acts / betrayal / combat are actually **engaged (gauge-backed), not narrated**.
+**A run therefore scores LOWER under this ruler than under the v1.0.4 rulers — by design: the scorer
+is a deliberately-tightening feedback loop, not a fixed yardstick.** Numbers are fenced by the
+`scoring_config_version` / `lens_config_version` stamped on every `scores_db` row — **never compare a
+current number to a historic one across different `sc_`/`lc_` hashes** (e.g. the historic
+`gs-ledger-deep` story **4.8** was an OLDER, looser ruler, not directly comparable to a current 4.1).
+See `qa/SCORING.md` §0 for the ruler-version history + how to re-score a historic transcript for an
+apples-to-apples comparison.
+
+- Docs current: MODEL-TIERING (the GLM lane + the honest 1-v-1 numbers), SCORING (the **§0
+  ruler-version history**; gate severity as honest measurement, *not* score-gaming; the coercion
+  contract; timing columns), the runbooks (#1031).
 
 - Licensing: WorldOS Source-Available Commercial EULA v1.0 + `ROYALTY-ADDENDUM.md`,
   `COMMERCIAL-LICENSE.md`, `CLA.md`, `CONTRIBUTING.md`, a PR CLA template. Prior MIT grants for
diff --git a/qa/SCORING.md b/qa/SCORING.md
index 12d23cc9..90bdc536 100644
--- a/qa/SCORING.md
+++ b/qa/SCORING.md
@@ -14,6 +14,40 @@
 The fitness function = **1 hard behavioral gate** (deterministic pass/fail) + **3 LLM
 lenses** (each 1–5). The gate is the honest floor; the lenses grade quality above it.
 
+## 0. The scoring ruler is VERSIONED — scores are NOT comparable across rulers
+
+Every scored run is stamped with the **content hash of the ruler that graded it**:
+`scoring_config_version` (`sc_…`, the FULL ruler — rubrics + schemas + all gates incl. RRI) and
+`lens_config_version` (`lc_…`, the 8 files that produce the lens numbers). Computed by
+`qa/scoring_config_version.py`, written on every `add_run(...)`. **This is load-bearing: a lens number
+means nothing without its ruler.** A 4.1 under a stricter ruler can be *better play* than a 4.8 under a
+looser one. NEVER compare two numbers across different `sc_`/`lc_` hashes — compare within a ruler, or
+re-score the old transcript under the current ruler.
+
+**The ruler is a deliberately-tightening FEEDBACK LOOP, not a fixed yardstick.** As we add
+engine-enforced systems (companions, acts, quests, betrayal, travel, combat coverage), the ruler is
+tightened to DEMAND those are actually *engaged* (gauge-backed), not merely narrated — so the same felt
+session scores *lower* under a newer ruler than an older one. **That drop is the ruler working, not a
+quality regression.** Expect current numbers to read below historic numbers; that is by design — the
+scorer exists to drive autonomous build-and-improve.
+
+### Ruler history
+- **`sc_d4b93982763a` / `lc_d7fcfddd5bf7` — current (2026-06 cycle → `v1.0.5-rc1`).** Materially
+  STRICTER than the v1.0.4 rulers. Adds: the **feature-engagement coverage scorer + forcing gate**
+  (#1018 — every *owed* authored system narrated-but-not-gauged is now a coverage miss); the
+  **acts-engine felt-shape scorer + flat-arc gate** (#1001/#1002 — a flat, act-less arc is penalized);
+  **betrayal un-inversion** (#999) and the **romance gate** (#997); the **`dm_advanced_time` unmask**
+  (#1024 — a frozen-clock DM no longer hides); and the **gate-severity accuracy repair** (#1030 —
+  removes false FATAL-caps so the floor is TRUE, not loose). A social/short slice that exercises few of
+  these reads markedly lower here than under v1.0.4.
+- **Older `sc_…` rulers (≤ `v1.0.4-rc5`).** Looser: no feature-engagement coverage demand, no acts
+  felt-shape, pre-betrayal-fix. Historic numbers — e.g. the `gs-ledger-deep` story **4.8** full-depth
+  proof — were graded by an OLDER ruler; **do not read them as directly comparable to current numbers.**
+
+To compare a historic run to today honestly, **re-score its transcript under the current ruler**
+(`qa/score.sh <transcript.md> <state.json> <rubric> <schema> <out> [budget]`), then compare `sc_`-equal
+rows only.
+
 ## 1. Behavioral gate — HARD pass/fail (`qa/assert_behavioral.py`)
 LLM scorers grade prose and can't be trusted to flip RED on a structurally broken run,
 so this deterministic gate does. Exit 0 = GREEN (warnings allowed), 1 = RED. FATAL checks: