Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,24 @@ architecture* so the gameplay work that follows is built on honest, un-gamed sig
- Story-engagement feedback loop (auto-seeded approval vocabulary, feature-engagement coverage
scorer, companion-quest orphan cue) (#1017–#1024); acts-engine runtime + felt-shape scorer
(#1001/#1002); weighted approval + diminishing returns + inter-companion stance (#1003–#1005).
- Docs current: MODEL-TIERING (the GLM lane + the honest 1-v-1 numbers), SCORING (gate severity as
honest measurement, *not* score-gaming; the coercion contract; timing columns), the runbooks (#1031).

### ⚠ Scoring ruler tightened — current scores are NOT comparable to historic numbers
The felt-world machinery above also **tightened the scoring ruler** to `sc_d4b93982763a` /
`lc_d7fcfddd5bf7`. The feature-engagement coverage scorer + forcing gate (#1018), the acts felt-shape
+ flat-arc gate (#1001/#1002), betrayal un-inversion (#999), the romance gate (#997), the
`dm_advanced_time` unmask (#1024), and the gate-severity *accuracy* repair (#1030) now DEMAND that
companions / quests / acts / betrayal / combat are actually **engaged (gauge-backed), not narrated**.
**A run therefore scores LOWER under this ruler than under the v1.0.4 rulers — by design: the scorer
is a deliberately-tightening feedback loop, not a fixed yardstick.** Numbers are fenced by the
`scoring_config_version` / `lens_config_version` stamped on every `scores_db` row — **never compare a
current number to a historic one across different `sc_`/`lc_` hashes** (e.g. the historic
`gs-ledger-deep` story **4.8** was an OLDER, looser ruler, not directly comparable to a current 4.1).
See `qa/SCORING.md` §0 for the ruler-version history + how to re-score a historic transcript for an
apples-to-apples comparison.

- Docs current: MODEL-TIERING (the GLM lane + the honest 1-v-1 numbers), SCORING (the **§0
ruler-version history**; gate severity as honest measurement, *not* score-gaming; the coercion
contract; timing columns), the runbooks (#1031).

- Licensing: WorldOS Source-Available Commercial EULA v1.0 + `ROYALTY-ADDENDUM.md`,
`COMMERCIAL-LICENSE.md`, `CLA.md`, `CONTRIBUTING.md`, a PR CLA template. Prior MIT grants for
Expand Down
34 changes: 34 additions & 0 deletions qa/SCORING.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,40 @@
The fitness function = **1 hard behavioral gate** (deterministic pass/fail) + **3 LLM
lenses** (each 1–5). The gate is the honest floor; the lenses grade quality above it.

## 0. The scoring ruler is VERSIONED — scores are NOT comparable across rulers

Every scored run is stamped with the **content hash of the ruler that graded it**:
`scoring_config_version` (`sc_…`, the FULL ruler — rubrics + schemas + all gates incl. RRI) and
`lens_config_version` (`lc_…`, the 8 files that produce the lens numbers). Computed by
`qa/scoring_config_version.py`, written on every `add_run(...)`. **This is load-bearing: a lens number
means nothing without its ruler.** A 4.1 under a stricter ruler can be *better play* than a 4.8 under a
looser one. NEVER compare two numbers across different `sc_`/`lc_` hashes — compare within a ruler, or
re-score the old transcript under the current ruler.

**The ruler is a deliberately-tightening FEEDBACK LOOP, not a fixed yardstick.** As we add
engine-enforced systems (companions, acts, quests, betrayal, travel, combat coverage), the ruler is
tightened to DEMAND those are actually *engaged* (gauge-backed), not merely narrated — so the same felt
session scores *lower* under a newer ruler than an older one. **That drop is the ruler working, not a
quality regression.** Expect current numbers to read below historic numbers; that is by design — the
scorer exists to drive autonomous build-and-improve.

### Ruler history
- **`sc_d4b93982763a` / `lc_d7fcfddd5bf7` — current (2026-06 cycle → `v1.0.5-rc1`).** Materially
STRICTER than the v1.0.4 rulers. Adds: the **feature-engagement coverage scorer + forcing gate**
(#1018 — every *owed* authored system narrated-but-not-gauged is now a coverage miss); the
**acts-engine felt-shape scorer + flat-arc gate** (#1001/#1002 — a flat, act-less arc is penalized);
**betrayal un-inversion** (#999) and the **romance gate** (#997); the **`dm_advanced_time` unmask**
(#1024 — a frozen-clock DM no longer hides); and the **gate-severity accuracy repair** (#1030 —
removes false FATAL-caps so the floor is TRUE, not loose). A social/short slice that exercises few of
these reads markedly lower here than under v1.0.4.
- **Older `sc_…` rulers (≤ `v1.0.4-rc5`).** Looser: no feature-engagement coverage demand, no acts
felt-shape, pre-betrayal-fix. Historic numbers — e.g. the `gs-ledger-deep` story **4.8** full-depth
proof — were graded by an OLDER ruler; **do not read them as directly comparable to current numbers.**

To compare a historic run to today honestly, **re-score its transcript under the current ruler**
(`qa/score.sh <transcript.md> <state.json> <rubric> <schema> <out> [budget]`), then compare `sc_`-equal
rows only.

## 1. Behavioral gate — HARD pass/fail (`qa/assert_behavioral.py`)
LLM scorers grade prose and can't be trusted to flip RED on a structurally broken run,
so this deterministic gate does. Exit 0 = GREEN (warnings allowed), 1 = RED. FATAL checks:
Expand Down
Loading