docs: annotate the tightened scoring ruler (scores NOT comparable across rulers)#1034
Conversation
…mparable across rulers The 2026-06 cycle materially tightened the scoring ruler (feature-engagement coverage scorer #1018, acts felt-shape #1001/#1002, betrayal un-inversion #999, romance gate #997, dm_advanced_time unmask #1024, gate-severity accuracy #1030), so a run scores LOWER under sc_d4b93982763a/lc_d7fcfddd5bf7 than under the v1.0.4 rulers — BY DESIGN (the scorer is a tightening feedback loop). Document the ruler-version mechanism + history in SCORING.md §0 and annotate it in the v1.0.5-rc1 CHANGELOG, so current numbers are never mis-compared to historic ones (every scores_db row is fenced by scoring_config_version/lens_config_version). Stable-checkpoint hygiene per owner.
|
Warning Review limit reached
More reviews will be available in 2 hours, 53 minutes, and 2 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
Comment |
… annotation (#1034) (#1035) Checkpoint marking the Guiding Bolt SRD duration fix (found by running the combat-sprint, proven RED->GREEN, adversarially reviewed) + the ruler-version annotation. Still NOT a GA — mech remains below the 4.5 bar; story BG-caliber + satisfaction green. Co-authored-by: Eva <arncalso@gmail.com>
Owner-flagged stable-checkpoint hygiene: current scores read lower than historic because the scoring ruler got more rigorous this cycle, not because quality regressed — and that wasn't annotated anywhere human-readable.
What this adds
qa/SCORING.md§0 — "The scoring ruler is VERSIONED": every run is fenced byscoring_config_version(sc_…) +lens_config_version(lc_…); a number is meaningless without its ruler; the ruler is a deliberately-tightening feedback loop, so newer-ruler numbers read lower by design. Includes a ruler history (currentsc_d4b93982763avs the looser ≤v1.0.4 rulers) + how to re-score a historic transcript for apples-to-apples.[1.0.5-rc1]— an explicit "⚠ Scoring ruler tightened" note tying the current ruler hash to the features that tightened it (feat(qa): feature-engagement feedback loop — manifest + coverage scorer + forcing gate (WS0, all-WARN) #1018/feat(engine): acts-engine — runtime act cursor + act-transition cues + advance tools (Phase B 1-3) #1001/feat(qa): acts-engine felt-shape scorer + flat-arc WARN gate (Phase B 4-5) #1002/fix(audit): un-invert betrayal — gates, agenda values, telegraph anchor #999/content: ashfall-reach — the romance golden spine (exercises the romance gate) #997/feat(qa): dm_advanced_time WARN guard — unmask a frozen DM the soft-tick hid (WS-E) #1024/fix(qa): gate-severity audit — stop FATAL behavioral gates false-capping short emergent duos #1030) and warning against cross-ruler comparison (e.g. the historicgs-ledger-deep4.8 was an older ruler).Why it matters
We're beyond a normal release window; stable, differentiable checkpoints are critical. The
sc_/lc_stamping already exists inscores_db— this makes the human framing match, so past/future runs and versions stay distinguishable and the scorer can serve as the autonomous build-and-improve feedback loop it's designed to be.Docs-only. 🤖 Generated with Claude Code