[codex] Enable MBTB VC16 and approximate dense-block interflush by jensen-yan · Pull Request #826 · OpenXiangShan/GEM5

jensen-yan · 2026-04-13T09:31:15Z

What Changed

This PR bundles two related branch-prediction changes for Xiangshan kmhv3:

Enable MBTB victim cache with victimCacheSize=16 by default in configs/example/kmhv3.py.
Add a gated GEM5-side approximation for dense-BTB-block internal interflush in DecoupledBPUWithBTB.

The interflush approximation is intentionally minimal:

default off
when enabled, if finalPred.btbEntries.size() > 8, inject a fixed bubble penalty
current evaluation setting uses entryLimit=8 and penaltyCycles=2

Why

sjeng contains dense control-flow windows where a 32B MBTB half-block can expose more than 4 branches, and with victim-cache recovery the total visible branch entries can exceed 8. RTL discussion indicated this is a realistic timing concern for the TAGE position-priority path.

This PR therefore does two things:

keeps the MBTB victim-cache benefit available by default for kmhv3 experiments
provides a simple model hook to estimate how much of that benefit remains if dense-entry cases pay an internal interflush penalty

Validation

Local validation completed:

./build/RISCV/cpu/pred/btb/test/interflush.test.opt
scons build/RISCV/gem5.opt -j32
local sjeng_64284 checkpoint runs

Observed local results on sjeng_64284:

VC16 vs VC=0: IPC improved from 2.468417 to 2.546431 (+3.16%)
VC16 + interflush(2) vs VC16: IPC changed from 2.546431 to 2.523163 (-0.91%)
VC16 + interflush(2) vs VC=0: IPC still improved by about +2.22%

This suggests the 2-cycle interflush approximation reduces some VC benefit, but does not eliminate it on this slice.

Notes

The interflush model is a throughput approximation only; it does not attempt to fully split FSQ/FTQ semantics.
The new interflush behavior is parameterized and disabled by default.

Change-Id: I1a44256d41beabc0e61e55240dcea9d6f5163eb8

Change-Id: I1c21db003e08b1cb10ed30498fff4490e3bd70bc

coderabbitai · 2026-04-13T09:31:23Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f5e4ff4-3c4c-4e4d-8812-d9e664cfd9be

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch vc16-align

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-13T09:46:55Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.2605`	-
This PR	`2.2874`	📈 `+0.0270` (`+1.19%`)

✅ Difftest smoke test passed!

Change-Id: Iece153fabf84e766595317dd2038dfd59abc845d

github-actions · 2026-04-13T10:19:16Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.2605`	-
This PR	`2.2799`	📈 `+0.0194` (`+0.86%`)

✅ Difftest smoke test passed!

XiangShanRobot · 2026-04-13T11:19:10Z

[Generated by GEM5 Performance Robot]
commit: cc93107
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

	PR	Master	Diff(%)
Score	18.87	18.73	+0.73 🟢

[Generated by GEM5 Performance Robot]
commit: cc93107
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

	PR	Previous Commit	Diff(%)
Score	18.87	18.89	-0.11 🔴

jensen-yan · 2026-04-14T03:35:48Z

根据当前 PR 的 performance comment，我又用本地 gem5_data_proc 对两次成功的 perf run 做了对比：

72dec64 (VC16 only)：overall score 18.89
cc93107 (VC16 + interflush approximation)：overall score 18.87
delta：-0.11%

结论和本地 micro-run 一致：

VC16 本身是有效的，PR 相对 xs-dev 仍有 +0.73% 提升。
开 interflush 后会吃掉一小部分收益，但没有把 VC16 的收益吃光。

按 benchmark 看，回退主要集中在：

omnetpp: -1.16%
sjeng: -0.66%

也有对冲项：

mcf: +0.74%

从 weighted counters 看，sjeng 上的回退模式和预期比较一致，主要是前端 bubble 增加：

overrideBubbleNum: +3.63%
fetch_bubble: +3.19%
cpi: +0.66%
fetch_nisn_mean: -0.87%

这更像是 dense-BTB-block 的内部 interflush penalty 在起作用，而不是 predictor correctness 出现大幅恶化。

omnetpp 的回退则不完全像单纯的 bubble penalty：

cpi: +1.17%
但 overrideBubbleNum: -1.98%
fetch_bubble: -0.91%
mbtb_update_miss: +29.6%

所以 omnetpp 更像是另一个值得继续单独下钻的点，不一定能直接归因到这次 interflush 近似本身。

总体上，这批数据支持一个比较稳的判断：

VC16 是值得保留的
interflush(2) 会带来小幅回退，但量级可控
当前收益/代价关系基本符合 RTL 侧预期

jensen-yan · 2026-04-14T06:27:23Z

补充说明一下这次实验的动机和当前结论，主要是为了把 RTL 侧的 timing 背景说清楚。

1. 这次实验要回答的问题

RTL 侧现在的核心问题不是“VC 有没有收益”，而是：

MBTB 和 TAGE 在 s1 同时查询
MBTB 在 s2 给出该 fetch block 内的分支 entries / positions
TAGE 在 s3 再基于这些分支位置做 tag match，并选出第一个 taken 分支

当一个 fetch block 内可见的 branch entries 太多时，特别是开启 VC 之后，单拍里需要处理的 branch positions 可能超过 RTL 目前容易收敛的范围。

对 sjeng 来说，这个担心是有现实基础的：它在热点函数里确实存在 32B 窗口内控制流非常密集的情况。

所以这次 PR 其实是在回答两个问题：

如果默认打开 MBTB VC16，收益有多大？
如果 RTL 需要对 >8 entries 的 rare case 做内部 interflush / 多打拍处理，这个 timing 代价会吃掉多少收益？

2. 这个 PR 在 gem5 里怎么近似这个问题

这次 PR 不是在 gem5 里精确复现 RTL pipeline，而是做了两个实验抓手：

VC16：评估 victim cache 能否回收 main BTB 4-way 半块装不下的 entries
interflush approximation：当 finalPred.btbEntries.size() > 8 时，额外注入固定 bubble penalty（当前实验用 2 拍）

也就是说，这个近似不是最终设计，只是用来估计：

dense-BTB-block 的 timing 特判如果存在
那么 VC 带来的收益还剩多少

3. 当前 CI 结果

PR comment 里已经有总分：

相对 xs-dev：18.73 -> 18.87，+0.73%
相对 previous commit (VC16 only)：18.89 -> 18.87，-0.11%

这个结果和本地 micro-run / gem5_data_proc 分析一致：

VC16 本身是有效的
interflush(2) 会吃掉一小部分收益
但没有把 VC16 的收益吃光

按 benchmark 看，VC16 + interflush 相对 VC16 only 的主要回退项是：

omnetpp: -1.16%
sjeng: -0.66%

也有对冲项：

mcf: +0.74%

sjeng 上的 weighted counters 也很符合预期，主要体现为前端 bubble 增加：

overrideBubbleNum: +3.63%
fetch_bubble: +3.19%
cpi: +0.66%
fetch_nisn_mean: -0.87%

这说明 sjeng 上那部分回退，基本就是 dense-BTB-block penalty 的直接代价。

4. 本地额外验证：VC16 是否已经基本解决 ways 不够的问题？

我又在本地用同一个 sjeng_64284 slice 做了额外对比，并且关掉了 interflush，只看 MBTB 本身：

VC16: numWays=4, victimCacheSize=16
8way: numWays=8, victimCacheSize=0
8way_ns: numWays=8, numEntries=16384, victimCacheSize=0（保持 numSets 不变，更接近“半块直接能存 8 条”）

结果：

VC16: ipc = 2.546431
8way: ipc = 2.559537
8way_ns: ipc = 2.561925

也就是说：

VC16 没有完全追平 “直接把半块容量扩到 8” 的效果
但已经吃到了大部分收益
剩余差距量级大约还有 ~0.5% 到 0.6% IPC

从热点分支看，这个判断也成立：

0x133a8 (push_slidE)：VC16=13881, 8way_ns=12110
0x1a542 (gen 的密集窗口)：VC16=666, 8way_ns=247

所以当前更准确的说法是：

VC16 已经显著缓解了 main BTB 4-way 半块容量不足的问题
但还不能完全等价于“半块直接支持 8 条 entry”
即便如此，从整体收益看，VC16 已经拿到了其中很大一部分增益

5. 目前我对这个方向的判断

当前数据支持一个比较稳的结论：

VC16 值得保留
如果 RTL 需要对 >8 entries 情况做内部 interflush / 多打一拍，收益会被吃掉一点，但量级可控
从 sjeng 看，VC16 已经拿到了 dense-block 问题里的大部分收益
如果后面还想继续追那剩余的 ~0.5%，更像要靠“真正提升半块可承载 entries 数量”而不是只靠 VC

Change-Id: Ie7832f1a5caf99aaaed12b15c0764c8bbbaf2964

github-actions · 2026-05-15T10:59:30Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.3130`	-
This PR	`2.3007`	📉 `-0.0124` (`-0.53%`)

✅ Difftest smoke test passed!

jensen-yan · 2026-05-18T07:14:09Z

中间结论（RTL-style VBTB 对齐实验，gcc15 SPEC06 0.3c Int）：

对比的 CI run：

base no-VC: 24987219976, Int 18.7218
old VC16: 24334267462, Int 18.8914
old VC16 + interflush approximation: 24337807496, Int 18.8715
RTL-style VC16: 25913540561, Int 18.7755

结论：RTL-style VC16 仍略高于 no-VC base（+0.287%），但旧 VC16 相对 base 的收益只保留了约 32%。这个回退不是单一现象，主要分两类：

VC hit 后拆 FB / 半预测块会明显伤 fetch 侧效率。最干净的例子是 perlbench：old VC -> RTL-style 时 score -2.75%，但 br_mis_pred -2.8%，分支错预测没有变差；真正变差的是 fetch_bubble +17.6%、frontendBound +13.2%、fetch_nisn_mean -4.6%、overrideBubbleNum +36.9%。这更像预测块被切短后，fetch 利用率和前端供给变差。
sjeng/gobmk/gcc 不是纯 split 问题，组相联约束也减少了 VC 覆盖能力。直接 VC counter 看，RTL-style 下 victimCacheSplit == victimCacheHit，即每次 VC hit 都会 split；同时 VC hit 本身也比旧全相联 VC 少：
- sjeng: VC hit 143k -> 104k，updateInVC 31k -> 18k，predHitNum 13.52M -> 13.08M，allBranchMisses +5.1%，uncondMisses 7.5k -> 15.1k
- gobmk: VC hit 251k -> 129k，predMiss +6.3%，allBranchMisses +2.7%
- gcc: VC hit 168k -> 107k，predMiss +8.8%

old VC -> RTL-style 的主要 Int 回退项：

perlbench: score -2.75%, fetch_bubble +17.6%, br_mis_pred -2.8%
gobmk: score -1.46%, fetch_bubble +5.0%, br_mis_pred +2.6%
sjeng: score -1.32%, fetch_bubble +3.9%, br_mis_pred +1.8%
xalancbmk: score -0.69%, fetch_bubble +11.9%, br_mis_pred +0.8%
gcc: score -0.55%, fetch_bubble +4.5%, br_mis_pred +1.5%

所以目前给 RTL 同学的中间判断是：收益降低的第一共同症状是 split-FB 造成的 fetch 利用率下降；但在 sjeng/gobmk/gcc 上，组相联 VC 也确实丢了一部分“等效 16-way”的覆盖能力。之前本地 sjeng_64284 ablation 也比较一致：set-assoc 单独损失约 0.52%，在 set-assoc 基础上再打开 split 额外损失约 0.85%，split 更大，但 set-assoc 不是零。

后续如果要继续对齐 RTL，我建议优先评估：是否能避免“每次 VC hit 都强制拆 FB”，或者只在 VC 结果真的需要占用后半块 / 发生 lost-bank 冲突时拆；另一个方向是优化 VBTB 的组相联 placement/hash，减少 sjeng/gobmk/gcc 这种 VC hit 和 updateInVC 明显掉下来的情况。

Change-Id: I68e230211b49c6f5015961605455e969cecfa424

github-actions · 2026-05-18T08:08:31Z

🚀 Coremark Smoke Test Results

Branch	IPC	Change
Base (`xs-dev`)	`2.3130`	-
This PR	`2.3294`	📈 `+0.0164` (`+0.71%`)

✅ Difftest smoke test passed!

jensen-yan added 2 commits April 13, 2026 16:47

configs: Enable MBTB victim cache in kmhv3

72dec64

Change-Id: I1a44256d41beabc0e61e55240dcea9d6f5163eb8

cpu: Approximate interflush on dense BTB blocks

cce46b9

Change-Id: I1c21db003e08b1cb10ed30498fff4490e3bd70bc

jensen-yan force-pushed the vc16-align branch from 2ecbe8a to cc93107 Compare April 13, 2026 10:09

configs: Enable interflush approximation in kmhv3

cc93107

Change-Id: Iece153fabf84e766595317dd2038dfd59abc845d

cpu: Model RTL-style MBTB victim cache constraints

7d7883b

Change-Id: Ie7832f1a5caf99aaaed12b15c0764c8bbbaf2964

configs: Disable MBTB victim cache split-on-hit

e983472

Change-Id: I68e230211b49c6f5015961605455e969cecfa424

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Enable MBTB VC16 and approximate dense-block interflush#826

[codex] Enable MBTB VC16 and approximate dense-block interflush#826
jensen-yan wants to merge 5 commits into
xs-devfrom
vc16-align

jensen-yan commented Apr 13, 2026

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

XiangShanRobot commented Apr 13, 2026

Uh oh!

jensen-yan commented Apr 14, 2026

Uh oh!

jensen-yan commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

jensen-yan commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jensen-yan commented Apr 13, 2026

What Changed

Why

Validation

Notes

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Apr 13, 2026

🚀 Coremark Smoke Test Results

Uh oh!

github-actions Bot commented Apr 13, 2026

🚀 Coremark Smoke Test Results

Uh oh!

XiangShanRobot commented Apr 13, 2026

Align BTB Performance

Overall Score

Align BTB Performance

Overall Score

Uh oh!

jensen-yan commented Apr 14, 2026

Uh oh!

jensen-yan commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. 这次实验要回答的问题

2. 这个 PR 在 gem5 里怎么近似这个问题

3. 当前 CI 结果

4. 本地额外验证：VC16 是否已经基本解决 ways 不够的问题？

5. 目前我对这个方向的判断

Uh oh!

github-actions Bot commented May 15, 2026

🚀 Coremark Smoke Test Results

Uh oh!

jensen-yan commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

🚀 Coremark Smoke Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

jensen-yan commented Apr 14, 2026 •

edited

Loading