[codex] Enable MBTB VC16 and approximate dense-block interflush#826
[codex] Enable MBTB VC16 and approximate dense-block interflush#826jensen-yan wants to merge 5 commits into
Conversation
Change-Id: I1a44256d41beabc0e61e55240dcea9d6f5163eb8
Change-Id: I1c21db003e08b1cb10ed30498fff4490e3bd70bc
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
Change-Id: Iece153fabf84e766595317dd2038dfd59abc845d
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
|
[Generated by GEM5 Performance Robot] Align BTB PerformanceOverall Score
[Generated by GEM5 Performance Robot] Align BTB PerformanceOverall Score
|
|
根据当前 PR 的 performance comment,我又用本地
结论和本地 micro-run 一致:
按 benchmark 看,回退主要集中在:
也有对冲项:
从 weighted counters 看,
这更像是 dense-BTB-block 的内部 interflush penalty 在起作用,而不是 predictor correctness 出现大幅恶化。
所以 总体上,这批数据支持一个比较稳的判断:
|
|
补充说明一下这次实验的动机和当前结论,主要是为了把 RTL 侧的 timing 背景说清楚。 1. 这次实验要回答的问题RTL 侧现在的核心问题不是“VC 有没有收益”,而是:
当一个 fetch block 内可见的 branch entries 太多时,特别是开启 VC 之后,单拍里需要处理的 branch positions 可能超过 RTL 目前容易收敛的范围。 对 所以这次 PR 其实是在回答两个问题:
2. 这个 PR 在 gem5 里怎么近似这个问题这次 PR 不是在 gem5 里精确复现 RTL pipeline,而是做了两个实验抓手:
也就是说,这个近似不是最终设计,只是用来估计:
3. 当前 CI 结果PR comment 里已经有总分:
这个结果和本地 micro-run /
按 benchmark 看,
也有对冲项:
这说明 4. 本地额外验证:VC16 是否已经基本解决 ways 不够的问题?我又在本地用同一个
结果:
也就是说:
从热点分支看,这个判断也成立:
所以当前更准确的说法是:
5. 目前我对这个方向的判断当前数据支持一个比较稳的结论:
|
Change-Id: Ie7832f1a5caf99aaaed12b15c0764c8bbbaf2964
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
|
中间结论(RTL-style VBTB 对齐实验,gcc15 SPEC06 0.3c Int): 对比的 CI run:
结论:RTL-style VC16 仍略高于 no-VC base(
old VC -> RTL-style 的主要 Int 回退项:
所以目前给 RTL 同学的中间判断是:收益降低的第一共同症状是 split-FB 造成的 fetch 利用率下降;但在 后续如果要继续对齐 RTL,我建议优先评估:是否能避免“每次 VC hit 都强制拆 FB”,或者只在 VC 结果真的需要占用后半块 / 发生 lost-bank 冲突时拆;另一个方向是优化 VBTB 的组相联 placement/hash,减少 |
Change-Id: I68e230211b49c6f5015961605455e969cecfa424
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
What Changed
This PR bundles two related branch-prediction changes for Xiangshan
kmhv3:victimCacheSize=16by default inconfigs/example/kmhv3.py.DecoupledBPUWithBTB.The interflush approximation is intentionally minimal:
finalPred.btbEntries.size() > 8, inject a fixed bubble penaltyentryLimit=8andpenaltyCycles=2Why
sjengcontains dense control-flow windows where a 32B MBTB half-block can expose more than 4 branches, and with victim-cache recovery the total visible branch entries can exceed 8. RTL discussion indicated this is a realistic timing concern for the TAGE position-priority path.This PR therefore does two things:
Validation
Local validation completed:
./build/RISCV/cpu/pred/btb/test/interflush.test.optscons build/RISCV/gem5.opt -j32sjeng_64284checkpoint runsObserved local results on
sjeng_64284:VC16vsVC=0: IPC improved from2.468417to2.546431(+3.16%)VC16 + interflush(2)vsVC16: IPC changed from2.546431to2.523163(-0.91%)VC16 + interflush(2)vsVC=0: IPC still improved by about+2.22%This suggests the 2-cycle interflush approximation reduces some VC benefit, but does not eliminate it on this slice.
Notes