Test262 interp↔compiled parity: 4,511 of 11,384 tests diverge (differential oracle at scale)

Diffing the two committed Test262 baselines (`baselines/interpreted.txt` vs `baselines/compiled.txt` — both regenerated **together** at `55f8241f` / #882, 2026-06-22, so they are aligned to one commit) shows the interpreter and compiler **disagree on 4,511 of 11,384 tests (~40%)**. This is the curated differential-parity harness (#910/#911) idea applied at scale: every divergence is a spot where the two execution modes produce different ECMA-262 outcomes, so at least one is wrong.

## Divergence histogram (interp bucket → compiled bucket)

| count | interp → compiled | meaning |
|---|---|---|
| 2326 | Fail → Pass | interp assertion fails, compiled passes |
| 1250 | RuntimeError → Pass | interp throws (non-Test262Error), compiled passes |
| 681 | RuntimeError → Fail | both fail, different way |
| 118 | Fail → RuntimeError | both fail, different way |
| **85** | **Pass → Fail** | **interp passes, compiled assertion fails — compiled bug** |
| **29** | **Pass → RuntimeError** | **interp passes, compiled throws — compiled bug** |
| 18 | Fail → HarnessError | |
| 3 | Timeout → Pass | |
| 1 | Fail → Timeout | |

## Two actionable subsets

### A. 114 compiled regressions (interp Pass, compiled Fail/RuntimeError) — clearest compiler bugs
Where the interpreter is the correct reference and the compiler is wrong. Clustered by category:

| count | category |
|---|---|
| 28 | built-ins/RegExp |
| 28 | built-ins/Object |
| 20 | built-ins/Array |
| 19 | built-ins/Promise |
| 14 | language/expressions |
| 4 | built-ins/String |
| 1 | built-ins/JSON |

Example Array cluster: `Array/prototype/with/index-bigger-or-eq-than-length` + `…/index-smaller-than-minus-length` (out-of-bounds RangeError), `includes/samevaluezero` (NaN / SameValueZero), `includes/this-is-not-object`, `reduce/15.4.4.21-9-*` (abrupt completions), `flat/bound-function-call`, `copyWithin|fill/return-abrupt-from-this`. The per-category clustering suggests shared root causes rather than 114 independent bugs.

### B. 3,576 interp-worse divergences (compiled Pass, interp Fail/RuntimeError) — surprising
The **compiled** path is *more* spec-compliant than the interpreter on a large slice (notably 1,250 whole-`RuntimeError → Pass` cases where interp throws on something compiled handles). This inverts the usual "interpreter = reference" assumption; a few systematic interp harness/feature gaps likely account for many, so root-causing the biggest interp `RuntimeError` clusters could close a large fraction at once.

## Reproduce (no re-run — diffs the committed baselines)
```bash
cd SharpTS.Test262/baselines
awk 'NR==FNR{if($1!~/^#/)c[$1]=$2;next} /^#/{next}
  {if(($1 in c)&&c[$1]!=$2)t[$2"  ->  "c[$1]]++}
  END{for(k in t)printf "%6d  %s\n",t[k],k}' compiled.txt interpreted.txt | sort -rn
```

## Recommendations
1. **Make interp↔compiled divergence a standing metric** — add a differential-report mode to the Test262 runner (per-mode data already exists) so this is tracked, not discovered ad hoc. This is the Test262 analogue of the curated `DifferentialParityTests` gate.
2. **Triage subset A** (114 compiled regressions) — unambiguous compiler bugs; the RegExp/Object/Array/Promise clusters likely share root causes.
3. **Investigate subset B's asymmetry** — find the systematic interp gap(s) behind the 3,576 (start with the largest interp-`RuntimeError` clusters).
4. **Caveat:** counts are as of baseline `55f8241f` (2026-06-22); Test262 isn't in CI yet (#71) so baselines can drift — a fresh aligned regen confirms current numbers.

_Found by extending the differential-parity-harness idea (#910/#911) from the curated corpus to the full Test262 corpus._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test262 interp↔compiled parity: 4,511 of 11,384 tests diverge (differential oracle at scale) #916

Divergence histogram (interp bucket → compiled bucket)

Two actionable subsets

A. 114 compiled regressions (interp Pass, compiled Fail/RuntimeError) — clearest compiler bugs

B. 3,576 interp-worse divergences (compiled Pass, interp Fail/RuntimeError) — surprising

Reproduce (no re-run — diffs the committed baselines)

Recommendations

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

count	interp → compiled	meaning
2326	Fail → Pass	interp assertion fails, compiled passes
1250	RuntimeError → Pass	interp throws (non-Test262Error), compiled passes
681	RuntimeError → Fail	both fail, different way
118	Fail → RuntimeError	both fail, different way
85	Pass → Fail	interp passes, compiled assertion fails — compiled bug
29	Pass → RuntimeError	interp passes, compiled throws — compiled bug
18	Fail → HarnessError
3	Timeout → Pass
1	Fail → Timeout

count	category
28	built-ins/RegExp
28	built-ins/Object
20	built-ins/Array
19	built-ins/Promise
14	language/expressions
4	built-ins/String
1	built-ins/JSON

Test262 interp↔compiled parity: 4,511 of 11,384 tests diverge (differential oracle at scale) #916

Description

Divergence histogram (interp bucket → compiled bucket)

Two actionable subsets

A. 114 compiled regressions (interp Pass, compiled Fail/RuntimeError) — clearest compiler bugs

B. 3,576 interp-worse divergences (compiled Pass, interp Fail/RuntimeError) — surprising

Reproduce (no re-run — diffs the committed baselines)

Recommendations

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions