You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Diffing the two committed Test262 baselines (baselines/interpreted.txt vs baselines/compiled.txt — both regenerated together at 55f8241f / #882, 2026-06-22, so they are aligned to one commit) shows the interpreter and compiler disagree on 4,511 of 11,384 tests (~40%). This is the curated differential-parity harness (#910/#911) idea applied at scale: every divergence is a spot where the two execution modes produce different ECMA-262 outcomes, so at least one is wrong.
Where the interpreter is the correct reference and the compiler is wrong. Clustered by category:
count
category
28
built-ins/RegExp
28
built-ins/Object
20
built-ins/Array
19
built-ins/Promise
14
language/expressions
4
built-ins/String
1
built-ins/JSON
Example Array cluster: Array/prototype/with/index-bigger-or-eq-than-length + …/index-smaller-than-minus-length (out-of-bounds RangeError), includes/samevaluezero (NaN / SameValueZero), includes/this-is-not-object, reduce/15.4.4.21-9-* (abrupt completions), flat/bound-function-call, copyWithin|fill/return-abrupt-from-this. The per-category clustering suggests shared root causes rather than 114 independent bugs.
B. 3,576 interp-worse divergences (compiled Pass, interp Fail/RuntimeError) — surprising
The compiled path is more spec-compliant than the interpreter on a large slice (notably 1,250 whole-RuntimeError → Pass cases where interp throws on something compiled handles). This inverts the usual "interpreter = reference" assumption; a few systematic interp harness/feature gaps likely account for many, so root-causing the biggest interp RuntimeError clusters could close a large fraction at once.
Reproduce (no re-run — diffs the committed baselines)
cd SharpTS.Test262/baselines
awk 'NR==FNR{if($1!~/^#/)c[$1]=$2;next} /^#/{next} {if(($1 in c)&&c[$1]!=$2)t[$2" -> "c[$1]]++} END{for(k in t)printf "%6d %s\n",t[k],k}' compiled.txt interpreted.txt | sort -rn
Recommendations
Make interp↔compiled divergence a standing metric — add a differential-report mode to the Test262 runner (per-mode data already exists) so this is tracked, not discovered ad hoc. This is the Test262 analogue of the curated DifferentialParityTests gate.
Triage subset A (114 compiled regressions) — unambiguous compiler bugs; the RegExp/Object/Array/Promise clusters likely share root causes.
Investigate subset B's asymmetry — find the systematic interp gap(s) behind the 3,576 (start with the largest interp-RuntimeError clusters).
Caveat: counts are as of baseline 55f8241f (2026-06-22); Test262 isn't in CI yet (Test262: CI integration #71) so baselines can drift — a fresh aligned regen confirms current numbers.
Found by extending the differential-parity-harness idea (#910/#911) from the curated corpus to the full Test262 corpus.
Diffing the two committed Test262 baselines (
baselines/interpreted.txtvsbaselines/compiled.txt— both regenerated together at55f8241f/ #882, 2026-06-22, so they are aligned to one commit) shows the interpreter and compiler disagree on 4,511 of 11,384 tests (~40%). This is the curated differential-parity harness (#910/#911) idea applied at scale: every divergence is a spot where the two execution modes produce different ECMA-262 outcomes, so at least one is wrong.Divergence histogram (interp bucket → compiled bucket)
Two actionable subsets
A. 114 compiled regressions (interp Pass, compiled Fail/RuntimeError) — clearest compiler bugs
Where the interpreter is the correct reference and the compiler is wrong. Clustered by category:
Example Array cluster:
Array/prototype/with/index-bigger-or-eq-than-length+…/index-smaller-than-minus-length(out-of-bounds RangeError),includes/samevaluezero(NaN / SameValueZero),includes/this-is-not-object,reduce/15.4.4.21-9-*(abrupt completions),flat/bound-function-call,copyWithin|fill/return-abrupt-from-this. The per-category clustering suggests shared root causes rather than 114 independent bugs.B. 3,576 interp-worse divergences (compiled Pass, interp Fail/RuntimeError) — surprising
The compiled path is more spec-compliant than the interpreter on a large slice (notably 1,250 whole-
RuntimeError → Passcases where interp throws on something compiled handles). This inverts the usual "interpreter = reference" assumption; a few systematic interp harness/feature gaps likely account for many, so root-causing the biggest interpRuntimeErrorclusters could close a large fraction at once.Reproduce (no re-run — diffs the committed baselines)
Recommendations
DifferentialParityTestsgate.RuntimeErrorclusters).55f8241f(2026-06-22); Test262 isn't in CI yet (Test262: CI integration #71) so baselines can drift — a fresh aligned regen confirms current numbers.Found by extending the differential-parity-harness idea (#910/#911) from the curated corpus to the full Test262 corpus.