Skip to content

[codex] Enable MBTB VC16 and approximate dense-block interflush#826

Draft
jensen-yan wants to merge 5 commits into
xs-devfrom
vc16-align
Draft

[codex] Enable MBTB VC16 and approximate dense-block interflush#826
jensen-yan wants to merge 5 commits into
xs-devfrom
vc16-align

Conversation

@jensen-yan
Copy link
Copy Markdown
Collaborator

What Changed

This PR bundles two related branch-prediction changes for Xiangshan kmhv3:

  1. Enable MBTB victim cache with victimCacheSize=16 by default in configs/example/kmhv3.py.
  2. Add a gated GEM5-side approximation for dense-BTB-block internal interflush in DecoupledBPUWithBTB.

The interflush approximation is intentionally minimal:

  • default off
  • when enabled, if finalPred.btbEntries.size() > 8, inject a fixed bubble penalty
  • current evaluation setting uses entryLimit=8 and penaltyCycles=2

Why

sjeng contains dense control-flow windows where a 32B MBTB half-block can expose more than 4 branches, and with victim-cache recovery the total visible branch entries can exceed 8. RTL discussion indicated this is a realistic timing concern for the TAGE position-priority path.

This PR therefore does two things:

  • keeps the MBTB victim-cache benefit available by default for kmhv3 experiments
  • provides a simple model hook to estimate how much of that benefit remains if dense-entry cases pay an internal interflush penalty

Validation

Local validation completed:

  • ./build/RISCV/cpu/pred/btb/test/interflush.test.opt
  • scons build/RISCV/gem5.opt -j32
  • local sjeng_64284 checkpoint runs

Observed local results on sjeng_64284:

  • VC16 vs VC=0: IPC improved from 2.468417 to 2.546431 (+3.16%)
  • VC16 + interflush(2) vs VC16: IPC changed from 2.546431 to 2.523163 (-0.91%)
  • VC16 + interflush(2) vs VC=0: IPC still improved by about +2.22%

This suggests the 2-cycle interflush approximation reduces some VC benefit, but does not eliminate it on this slice.

Notes

  • The interflush model is a throughput approximation only; it does not attempt to fully split FSQ/FTQ semantics.
  • The new interflush behavior is parameterized and disabled by default.

Change-Id: I1a44256d41beabc0e61e55240dcea9d6f5163eb8
Change-Id: I1c21db003e08b1cb10ed30498fff4490e3bd70bc
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f5e4ff4-3c4c-4e4d-8812-d9e664cfd9be

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vc16-align

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2605 -
This PR 2.2874 📈 +0.0270 (+1.19%)

✅ Difftest smoke test passed!

Change-Id: Iece153fabf84e766595317dd2038dfd59abc845d
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.2605 -
This PR 2.2799 📈 +0.0194 (+0.86%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link
Copy Markdown

[Generated by GEM5 Performance Robot]
commit: cc93107
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

PR Master Diff(%)
Score 18.87 18.73 +0.73 🟢

[Generated by GEM5 Performance Robot]
commit: cc93107
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

PR Previous Commit Diff(%)
Score 18.87 18.89 -0.11 🔴

@jensen-yan
Copy link
Copy Markdown
Collaborator Author

根据当前 PR 的 performance comment,我又用本地 gem5_data_proc 对两次成功的 perf run 做了对比:

  • 72dec64 (VC16 only):overall score 18.89
  • cc93107 (VC16 + interflush approximation):overall score 18.87
  • delta:-0.11%

结论和本地 micro-run 一致:

  • VC16 本身是有效的,PR 相对 xs-dev 仍有 +0.73% 提升。
  • interflush 后会吃掉一小部分收益,但没有把 VC16 的收益吃光。

按 benchmark 看,回退主要集中在:

  • omnetpp: -1.16%
  • sjeng: -0.66%

也有对冲项:

  • mcf: +0.74%

从 weighted counters 看,sjeng 上的回退模式和预期比较一致,主要是前端 bubble 增加:

  • overrideBubbleNum: +3.63%
  • fetch_bubble: +3.19%
  • cpi: +0.66%
  • fetch_nisn_mean: -0.87%

这更像是 dense-BTB-block 的内部 interflush penalty 在起作用,而不是 predictor correctness 出现大幅恶化。

omnetpp 的回退则不完全像单纯的 bubble penalty:

  • cpi: +1.17%
  • overrideBubbleNum: -1.98%
  • fetch_bubble: -0.91%
  • mbtb_update_miss: +29.6%

所以 omnetpp 更像是另一个值得继续单独下钻的点,不一定能直接归因到这次 interflush 近似本身。

总体上,这批数据支持一个比较稳的判断:

  • VC16 是值得保留的
  • interflush(2) 会带来小幅回退,但量级可控
  • 当前收益/代价关系基本符合 RTL 侧预期

@jensen-yan
Copy link
Copy Markdown
Collaborator Author

jensen-yan commented Apr 14, 2026

补充说明一下这次实验的动机和当前结论,主要是为了把 RTL 侧的 timing 背景说清楚。

1. 这次实验要回答的问题

RTL 侧现在的核心问题不是“VC 有没有收益”,而是:

  • MBTB 和 TAGE 在 s1 同时查询
  • MBTB 在 s2 给出该 fetch block 内的分支 entries / positions
  • TAGE 在 s3 再基于这些分支位置做 tag match,并选出第一个 taken 分支

当一个 fetch block 内可见的 branch entries 太多时,特别是开启 VC 之后,单拍里需要处理的 branch positions 可能超过 RTL 目前容易收敛的范围。

sjeng 来说,这个担心是有现实基础的:它在热点函数里确实存在 32B 窗口内控制流非常密集的情况。

所以这次 PR 其实是在回答两个问题:

  1. 如果默认打开 MBTB VC16,收益有多大?
  2. 如果 RTL 需要对 >8 entries 的 rare case 做内部 interflush / 多打拍处理,这个 timing 代价会吃掉多少收益?

2. 这个 PR 在 gem5 里怎么近似这个问题

这次 PR 不是在 gem5 里精确复现 RTL pipeline,而是做了两个实验抓手:

  • VC16:评估 victim cache 能否回收 main BTB 4-way 半块装不下的 entries
  • interflush approximation:当 finalPred.btbEntries.size() > 8 时,额外注入固定 bubble penalty(当前实验用 2 拍)

也就是说,这个近似不是最终设计,只是用来估计:

  • dense-BTB-block 的 timing 特判如果存在
  • 那么 VC 带来的收益还剩多少

3. 当前 CI 结果

PR comment 里已经有总分:

  • 相对 xs-dev18.73 -> 18.87+0.73%
  • 相对 previous commit (VC16 only):18.89 -> 18.87-0.11%

这个结果和本地 micro-run / gem5_data_proc 分析一致:

  • VC16 本身是有效的
  • interflush(2) 会吃掉一小部分收益
  • 但没有把 VC16 的收益吃光

按 benchmark 看,VC16 + interflush 相对 VC16 only 的主要回退项是:

  • omnetpp: -1.16%
  • sjeng: -0.66%

也有对冲项:

  • mcf: +0.74%

sjeng 上的 weighted counters 也很符合预期,主要体现为前端 bubble 增加:

  • overrideBubbleNum: +3.63%
  • fetch_bubble: +3.19%
  • cpi: +0.66%
  • fetch_nisn_mean: -0.87%

这说明 sjeng 上那部分回退,基本就是 dense-BTB-block penalty 的直接代价。

4. 本地额外验证:VC16 是否已经基本解决 ways 不够的问题?

我又在本地用同一个 sjeng_64284 slice 做了额外对比,并且关掉了 interflush,只看 MBTB 本身:

  • VC16: numWays=4, victimCacheSize=16
  • 8way: numWays=8, victimCacheSize=0
  • 8way_ns: numWays=8, numEntries=16384, victimCacheSize=0(保持 numSets 不变,更接近“半块直接能存 8 条”)

结果:

  • VC16: ipc = 2.546431
  • 8way: ipc = 2.559537
  • 8way_ns: ipc = 2.561925

也就是说:

  • VC16 没有完全追平 “直接把半块容量扩到 8” 的效果
  • 但已经吃到了大部分收益
  • 剩余差距量级大约还有 ~0.5% 到 0.6% IPC

从热点分支看,这个判断也成立:

  • 0x133a8 (push_slidE):VC16=13881, 8way_ns=12110
  • 0x1a542 (gen 的密集窗口):VC16=666, 8way_ns=247

所以当前更准确的说法是:

  • VC16 已经显著缓解了 main BTB 4-way 半块容量不足的问题
  • 但还不能完全等价于“半块直接支持 8 条 entry”
  • 即便如此,从整体收益看,VC16 已经拿到了其中很大一部分增益

5. 目前我对这个方向的判断

当前数据支持一个比较稳的结论:

  • VC16 值得保留
  • 如果 RTL 需要对 >8 entries 情况做内部 interflush / 多打一拍,收益会被吃掉一点,但量级可控
  • sjeng 看,VC16 已经拿到了 dense-block 问题里的大部分收益
  • 如果后面还想继续追那剩余的 ~0.5%,更像要靠“真正提升半块可承载 entries 数量”而不是只靠 VC

Change-Id: Ie7832f1a5caf99aaaed12b15c0764c8bbbaf2964
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.3130 -
This PR 2.3007 📉 -0.0124 (-0.53%)

✅ Difftest smoke test passed!

@jensen-yan
Copy link
Copy Markdown
Collaborator Author

中间结论(RTL-style VBTB 对齐实验,gcc15 SPEC06 0.3c Int):

对比的 CI run:

  • base no-VC: 24987219976, Int 18.7218
  • old VC16: 24334267462, Int 18.8914
  • old VC16 + interflush approximation: 24337807496, Int 18.8715
  • RTL-style VC16: 25913540561, Int 18.7755

结论:RTL-style VC16 仍略高于 no-VC base(+0.287%),但旧 VC16 相对 base 的收益只保留了约 32%。这个回退不是单一现象,主要分两类:

  1. VC hit 后拆 FB / 半预测块会明显伤 fetch 侧效率。最干净的例子是 perlbench:old VC -> RTL-style 时 score -2.75%,但 br_mis_pred -2.8%,分支错预测没有变差;真正变差的是 fetch_bubble +17.6%frontendBound +13.2%fetch_nisn_mean -4.6%overrideBubbleNum +36.9%。这更像预测块被切短后,fetch 利用率和前端供给变差。

  2. sjeng/gobmk/gcc 不是纯 split 问题,组相联约束也减少了 VC 覆盖能力。直接 VC counter 看,RTL-style 下 victimCacheSplit == victimCacheHit,即每次 VC hit 都会 split;同时 VC hit 本身也比旧全相联 VC 少:

    • sjeng: VC hit 143k -> 104kupdateInVC 31k -> 18kpredHitNum 13.52M -> 13.08MallBranchMisses +5.1%uncondMisses 7.5k -> 15.1k
    • gobmk: VC hit 251k -> 129kpredMiss +6.3%allBranchMisses +2.7%
    • gcc: VC hit 168k -> 107kpredMiss +8.8%

old VC -> RTL-style 的主要 Int 回退项:

  • perlbench: score -2.75%, fetch_bubble +17.6%, br_mis_pred -2.8%
  • gobmk: score -1.46%, fetch_bubble +5.0%, br_mis_pred +2.6%
  • sjeng: score -1.32%, fetch_bubble +3.9%, br_mis_pred +1.8%
  • xalancbmk: score -0.69%, fetch_bubble +11.9%, br_mis_pred +0.8%
  • gcc: score -0.55%, fetch_bubble +4.5%, br_mis_pred +1.5%

所以目前给 RTL 同学的中间判断是:收益降低的第一共同症状是 split-FB 造成的 fetch 利用率下降;但在 sjeng/gobmk/gcc 上,组相联 VC 也确实丢了一部分“等效 16-way”的覆盖能力。之前本地 sjeng_64284 ablation 也比较一致:set-assoc 单独损失约 0.52%,在 set-assoc 基础上再打开 split 额外损失约 0.85%,split 更大,但 set-assoc 不是零。

后续如果要继续对齐 RTL,我建议优先评估:是否能避免“每次 VC hit 都强制拆 FB”,或者只在 VC 结果真的需要占用后半块 / 发生 lost-bank 冲突时拆;另一个方向是优化 VBTB 的组相联 placement/hash,减少 sjeng/gobmk/gcc 这种 VC hit 和 updateInVC 明显掉下来的情况。

Change-Id: I68e230211b49c6f5015961605455e969cecfa424
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.3130 -
This PR 2.3294 📈 +0.0164 (+0.71%)

✅ Difftest smoke test passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants