Skip to content

Test262 interp↔compiled parity: 4,511 of 11,384 tests diverge (differential oracle at scale) #916

Description

@nickna

Diffing the two committed Test262 baselines (baselines/interpreted.txt vs baselines/compiled.txt — both regenerated together at 55f8241f / #882, 2026-06-22, so they are aligned to one commit) shows the interpreter and compiler disagree on 4,511 of 11,384 tests (~40%). This is the curated differential-parity harness (#910/#911) idea applied at scale: every divergence is a spot where the two execution modes produce different ECMA-262 outcomes, so at least one is wrong.

Divergence histogram (interp bucket → compiled bucket)

count interp → compiled meaning
2326 Fail → Pass interp assertion fails, compiled passes
1250 RuntimeError → Pass interp throws (non-Test262Error), compiled passes
681 RuntimeError → Fail both fail, different way
118 Fail → RuntimeError both fail, different way
85 Pass → Fail interp passes, compiled assertion fails — compiled bug
29 Pass → RuntimeError interp passes, compiled throws — compiled bug
18 Fail → HarnessError
3 Timeout → Pass
1 Fail → Timeout

Two actionable subsets

A. 114 compiled regressions (interp Pass, compiled Fail/RuntimeError) — clearest compiler bugs

Where the interpreter is the correct reference and the compiler is wrong. Clustered by category:

count category
28 built-ins/RegExp
28 built-ins/Object
20 built-ins/Array
19 built-ins/Promise
14 language/expressions
4 built-ins/String
1 built-ins/JSON

Example Array cluster: Array/prototype/with/index-bigger-or-eq-than-length + …/index-smaller-than-minus-length (out-of-bounds RangeError), includes/samevaluezero (NaN / SameValueZero), includes/this-is-not-object, reduce/15.4.4.21-9-* (abrupt completions), flat/bound-function-call, copyWithin|fill/return-abrupt-from-this. The per-category clustering suggests shared root causes rather than 114 independent bugs.

B. 3,576 interp-worse divergences (compiled Pass, interp Fail/RuntimeError) — surprising

The compiled path is more spec-compliant than the interpreter on a large slice (notably 1,250 whole-RuntimeError → Pass cases where interp throws on something compiled handles). This inverts the usual "interpreter = reference" assumption; a few systematic interp harness/feature gaps likely account for many, so root-causing the biggest interp RuntimeError clusters could close a large fraction at once.

Reproduce (no re-run — diffs the committed baselines)

cd SharpTS.Test262/baselines
awk 'NR==FNR{if($1!~/^#/)c[$1]=$2;next} /^#/{next}
  {if(($1 in c)&&c[$1]!=$2)t[$2"  ->  "c[$1]]++}
  END{for(k in t)printf "%6d  %s\n",t[k],k}' compiled.txt interpreted.txt | sort -rn

Recommendations

  1. Make interp↔compiled divergence a standing metric — add a differential-report mode to the Test262 runner (per-mode data already exists) so this is tracked, not discovered ad hoc. This is the Test262 analogue of the curated DifferentialParityTests gate.
  2. Triage subset A (114 compiled regressions) — unambiguous compiler bugs; the RegExp/Object/Array/Promise clusters likely share root causes.
  3. Investigate subset B's asymmetry — find the systematic interp gap(s) behind the 3,576 (start with the largest interp-RuntimeError clusters).
  4. Caveat: counts are as of baseline 55f8241f (2026-06-22); Test262 isn't in CI yet (Test262: CI integration #71) so baselines can drift — a fresh aligned regen confirms current numbers.

Found by extending the differential-parity-harness idea (#910/#911) from the curated corpus to the full Test262 corpus.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions