From 47ad05437499542f97cd54283248fad2fb9a6069 Mon Sep 17 00:00:00 2001 From: Eva Date: Fri, 19 Jun 2026 23:08:18 +0700 Subject: [PATCH] =?UTF-8?q?docs:=20annotate=20the=20tightened=20scoring=20?= =?UTF-8?q?ruler=20(sc=5Fd4b9=E2=80=A6)=20=E2=80=94=20scores=20NOT=20compa?= =?UTF-8?q?rable=20across=20rulers?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 2026-06 cycle materially tightened the scoring ruler (feature-engagement coverage scorer #1018, acts felt-shape #1001/#1002, betrayal un-inversion #999, romance gate #997, dm_advanced_time unmask #1024, gate-severity accuracy #1030), so a run scores LOWER under sc_d4b93982763a/lc_d7fcfddd5bf7 than under the v1.0.4 rulers — BY DESIGN (the scorer is a tightening feedback loop). Document the ruler-version mechanism + history in SCORING.md §0 and annotate it in the v1.0.5-rc1 CHANGELOG, so current numbers are never mis-compared to historic ones (every scores_db row is fenced by scoring_config_version/lens_config_version). Stable-checkpoint hygiene per owner. --- CHANGELOG.md | 20 ++++++++++++++++++-- qa/SCORING.md | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 52 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 13742cae..eb0a9db4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -51,8 +51,24 @@ architecture* so the gameplay work that follows is built on honest, un-gamed sig - Story-engagement feedback loop (auto-seeded approval vocabulary, feature-engagement coverage scorer, companion-quest orphan cue) (#1017–#1024); acts-engine runtime + felt-shape scorer (#1001/#1002); weighted approval + diminishing returns + inter-companion stance (#1003–#1005). -- Docs current: MODEL-TIERING (the GLM lane + the honest 1-v-1 numbers), SCORING (gate severity as - honest measurement, *not* score-gaming; the coercion contract; timing columns), the runbooks (#1031). + +### ⚠ Scoring ruler tightened — current scores are NOT comparable to historic numbers +The felt-world machinery above also **tightened the scoring ruler** to `sc_d4b93982763a` / +`lc_d7fcfddd5bf7`. The feature-engagement coverage scorer + forcing gate (#1018), the acts felt-shape ++ flat-arc gate (#1001/#1002), betrayal un-inversion (#999), the romance gate (#997), the +`dm_advanced_time` unmask (#1024), and the gate-severity *accuracy* repair (#1030) now DEMAND that +companions / quests / acts / betrayal / combat are actually **engaged (gauge-backed), not narrated**. +**A run therefore scores LOWER under this ruler than under the v1.0.4 rulers — by design: the scorer +is a deliberately-tightening feedback loop, not a fixed yardstick.** Numbers are fenced by the +`scoring_config_version` / `lens_config_version` stamped on every `scores_db` row — **never compare a +current number to a historic one across different `sc_`/`lc_` hashes** (e.g. the historic +`gs-ledger-deep` story **4.8** was an OLDER, looser ruler, not directly comparable to a current 4.1). +See `qa/SCORING.md` §0 for the ruler-version history + how to re-score a historic transcript for an +apples-to-apples comparison. + +- Docs current: MODEL-TIERING (the GLM lane + the honest 1-v-1 numbers), SCORING (the **§0 + ruler-version history**; gate severity as honest measurement, *not* score-gaming; the coercion + contract; timing columns), the runbooks (#1031). - Licensing: WorldOS Source-Available Commercial EULA v1.0 + `ROYALTY-ADDENDUM.md`, `COMMERCIAL-LICENSE.md`, `CLA.md`, `CONTRIBUTING.md`, a PR CLA template. Prior MIT grants for diff --git a/qa/SCORING.md b/qa/SCORING.md index 12d23cc9..90bdc536 100644 --- a/qa/SCORING.md +++ b/qa/SCORING.md @@ -14,6 +14,40 @@ The fitness function = **1 hard behavioral gate** (deterministic pass/fail) + **3 LLM lenses** (each 1–5). The gate is the honest floor; the lenses grade quality above it. +## 0. The scoring ruler is VERSIONED — scores are NOT comparable across rulers + +Every scored run is stamped with the **content hash of the ruler that graded it**: +`scoring_config_version` (`sc_…`, the FULL ruler — rubrics + schemas + all gates incl. RRI) and +`lens_config_version` (`lc_…`, the 8 files that produce the lens numbers). Computed by +`qa/scoring_config_version.py`, written on every `add_run(...)`. **This is load-bearing: a lens number +means nothing without its ruler.** A 4.1 under a stricter ruler can be *better play* than a 4.8 under a +looser one. NEVER compare two numbers across different `sc_`/`lc_` hashes — compare within a ruler, or +re-score the old transcript under the current ruler. + +**The ruler is a deliberately-tightening FEEDBACK LOOP, not a fixed yardstick.** As we add +engine-enforced systems (companions, acts, quests, betrayal, travel, combat coverage), the ruler is +tightened to DEMAND those are actually *engaged* (gauge-backed), not merely narrated — so the same felt +session scores *lower* under a newer ruler than an older one. **That drop is the ruler working, not a +quality regression.** Expect current numbers to read below historic numbers; that is by design — the +scorer exists to drive autonomous build-and-improve. + +### Ruler history +- **`sc_d4b93982763a` / `lc_d7fcfddd5bf7` — current (2026-06 cycle → `v1.0.5-rc1`).** Materially + STRICTER than the v1.0.4 rulers. Adds: the **feature-engagement coverage scorer + forcing gate** + (#1018 — every *owed* authored system narrated-but-not-gauged is now a coverage miss); the + **acts-engine felt-shape scorer + flat-arc gate** (#1001/#1002 — a flat, act-less arc is penalized); + **betrayal un-inversion** (#999) and the **romance gate** (#997); the **`dm_advanced_time` unmask** + (#1024 — a frozen-clock DM no longer hides); and the **gate-severity accuracy repair** (#1030 — + removes false FATAL-caps so the floor is TRUE, not loose). A social/short slice that exercises few of + these reads markedly lower here than under v1.0.4. +- **Older `sc_…` rulers (≤ `v1.0.4-rc5`).** Looser: no feature-engagement coverage demand, no acts + felt-shape, pre-betrayal-fix. Historic numbers — e.g. the `gs-ledger-deep` story **4.8** full-depth + proof — were graded by an OLDER ruler; **do not read them as directly comparable to current numbers.** + +To compare a historic run to today honestly, **re-score its transcript under the current ruler** +(`qa/score.sh [budget]`), then compare `sc_`-equal +rows only. + ## 1. Behavioral gate — HARD pass/fail (`qa/assert_behavioral.py`) LLM scorers grade prose and can't be trusted to flip RED on a structurally broken run, so this deterministic gate does. Exit 0 = GREEN (warnings allowed), 1 = RED. FATAL checks: