Skip to content

Kmh smt rebase#863

Open
tastynoob wants to merge 37 commits into
xs-devfrom
kmh-smt-rebase
Open

Kmh smt rebase#863
tastynoob wants to merge 37 commits into
xs-devfrom
kmh-smt-rebase

Conversation

@tastynoob
Copy link
Copy Markdown
Collaborator

@tastynoob tastynoob commented May 25, 2026

Summary by CodeRabbit

  • New Features

    • SMT support: shared/independent queue modes and dynamic resource borrowing.
    • New HartCtrl MMIO device for per-hart reset/control in bare-metal RISC‑V.
    • Thread-aware branch predictor and FTQ/BTB enhancements for ASID/thread isolation.
    • Multi-context difftest/golden-memory support.
  • Improvements

    • Per-thread LSQ/store-buffer, commit, rename, and fetch/dispatch refinements for SMT.
    • Better privilege handling and more robust store visibility/reconciliation for multi-threaded runs.

Review Change Stack

mhnGitHubz and others added 30 commits May 22, 2026 17:45
Co-authored-by: mohaonan <mo.haonan1@sanechips.com.cn>
Change-Id: I7690e69545b01ca4a8ba3e751f6cab7665f8767e
Change-Id: Ifbeb947f5f27ebdc9dc39dcfe0172eaa308f8e6f
Change-Id: I146d1ac20d06015e98713f30bae71fef3f5d7bcf
Change-Id: Icc05a7320ee5bf1495ef98694c3e92847613d79e
Co-authored-by: mo haonan <mohaonan@node024.bosccluster.com>
Change-Id: Ic416b537a4f2c87059c92c7b5be81618b1898e22
Change-Id: Ieb232b296c7c99ea216c14c23f135e6e081870a6
Change-Id: I11b460b6a6554998d052a020a02d84eb2b0664ad
Change-Id: Ice8e66d841a40e8c8420bd4756237eb9399d1642
Use per-thread committed instruction counts for O3 warmup and stat-dump stop checks in SMT mode, instead of summing instructions across threads.

Change-Id: I6ecd5f96a18ce9aa96d0712a9e05f3d8dedcbac4
Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: Idea57378fbe47dedd654f168141d0284faee716f
Change-Id: Ic49b3e7ab82a32b81427026ce8d185e7ebeaba76
Change-Id: I4ee24d52f8ffe8fdc0a2086d97bcbc9be10860cb
Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I653f16b0adefdfcc978f54791e21adbf74ecd84e
Change-Id: I8060ec785799940c4190fdba32b349226f44795d
Change-Id: Ia8509f4290de8b2c7e337892506c69a8d399d493
…buffer resource allocation mechanism; 3. Vectorize waitForVsetvl (#844)

Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I30d0a510565c278349b4d4b915e84ead21f8c6ae
Change-Id: Id6528ad1ec2b2ad7a26cd0ef18030b8f671c3302
Change-Id: Ica716ed754083ab470c1eace992837b454547b55
Carry the triggering request context through prefetch metadata and final HardPFReq creation. Use context-aware prefetch queue matching so SMT threads with the same virtual address do not squash or deduplicate each other's prefetches.

Validation:

- scons build/RISCV/gem5.opt -j16

- git diff --check

Change-Id: I6a40826b47bff2e7ee8c7748eecd8622b44ca3c3
Use per-instruction dispatch age for IQ ordering and remove the old SMT-specific selector plumbing.

Change-Id: I06f4cec1fcbe910652272cf1caf9366bdbbea171
…est priority. (#847)

Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I1078cf7423248c619faeb4fde7f0a210a7d02b77
tastynoob and others added 6 commits May 25, 2026 11:20
Change-Id: I735b9557f8f8d69f094121bb6229bd3ab7682f49
Change-Id: Ie6f7cfc4bd2d2d8f7aad6c9e7bace19534b77c00
Change-Id: Iccef736835125a5ef35efb06c3eeced41f189b6c
…adlock (#858)

Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I615bad73ac28ceea5e51bd1f594f94acda1f7705
Change-Id: I50ab39ce30eebfb6d129c2c0fafd8f855b536730
@tastynoob tastynoob requested a review from jensen-yan May 25, 2026 07:23
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b7c1646-6769-4145-82b0-cfdaa52c8fbe

📥 Commits

Reviewing files that changed from the base of the PR and between 8d75755 and 446710f.

📒 Files selected for processing (1)
  • src/cpu/pred/btb/test/btb_tage.test.cc

📝 Walkthrough

Walkthrough

Adds FS-SMT support across configs, per-thread difftest and golden-memory, per-thread O3 pipeline/front-end and backend changes, shared LSQ/storebuffer with generation/visibility, ASID-aware branch predictors, HartCtrl MMIO device, vector-load agnostic fills, and prefetch context propagation.

Changes

FS-SMT Enablement and Per-Thread Refactors

Layer / File(s) Summary
Platform wiring and device configs/common/*, src/dev/riscv/*, configs/example/smt_idealkmhv3.py — pass num_threads, add HartCtrl SimObject and C++ device, update .gitignore.
Difftest multi-context src/cpu/base.*, src/sim/system.*, src/cpu/difftest.*, src/cpu/simple/* — per-thread difftest state, golden memory per-thread, difftestHartId, per-thread register reads.
O3 frontend SMT src/cpu/o3/fetch*, decode*, smt_sched.hh — per-thread fetch/decode scheduling, FTQ sharing, SMT decode scheduler, retry queue handling.
O3 backend SMT and arbitration src/cpu/o3/commit*, iew*, rob*, issue_queue*, inst_queue*, comm.hh, rename*, dyn_inst.hh — per-thread commit/commit-stats, dynamic ROB borrowing, SmtActiveThreadArbiter, per-thread squash/versioning, dispatch age tracking.
Shared LSQ/store buffer src/cpu/o3/lsq*, lsq_unit* — shared/partitioned LSQ modes, store-buffer per-byte generation/versioning, cross-thread visibility/invalidation, sequence-bounded flush, forwarding APIs.
Branch predictors (ASID/TID) src/cpu/pred/btb/*, BranchPredictor.py — per-thread/ASID metadata, getPredictionMeta(tid), asidHash plumbing across ABTB/ITTAGE/TAGE/MGSC/MicroTAGE/uBTB/RAS, FTQ params.
Prefetch context tracking src/mem/cache/prefetch/* — add ContextID to PrefetchInfo, propagate through queued/worker/request paths.
RISC-V TLB and vector ISA src/arch/riscv/tlb.*, src/arch/riscv/isa/vector/* — per-thread old-priv bookkeeping and new agnostic vector-load fill macros.
Misc and tests .gitignore, src/mem/physical.cc, predictor tests and example configs updated for SMT/TID/ASID.

Sequence Diagram(s)

sequenceDiagram
  participant Fetch
  participant Predictor
  participant FTQ
  participant Decode
  participant LSQ
  Fetch->>Predictor: putPCHistory(tid, asidHash)
  Predictor-->>Fetch: stagePreds[tid] with metas
  Fetch->>FTQ: enqueue targets (shared/partitioned FTQ)
  FTQ-->>Decode: dequeue per-thread entries
  Decode->>LSQ: load/store requests (per-thread)
  LSQ-->>Decode: backpressure / store-forwarding info
Loading

Estimated code review effort
🎯 5 (Critical) | ⏱️ ~180 minutes

Possibly related PRs

Suggested labels
perf, align-kmhv3

Suggested reviewers

  • jensen-yan

Poem

"I’m a rabbit in the circuit maze,
Hopping threads in SMT's bright haze.
FTQs whisper, LSQs hum,
HartCtrl nudges sleepy harts—come!
Carrot-code and nimble paws, hooray, the cores all play!"

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmh-smt-rebase

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/cpu/base.cc (1)

743-749: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the new per-thread difftest replay state on CPU handover.

takeOverFrom() only transfers diffAllStates, but the new reconciliation state lives in recentCommittedStores and syncVisibleStoreReplayArmed. After a CPU switch, the replacement core loses the committed-store history and replay arm bits that the new load/store mismatch recovery depends on, so difftest can diverge immediately after handover.

Suggested fix
     if (enable_diff) {
         warn("Take over difftest state to new CPU\n");
         enableDifftest = enable_diff;
         takeOverDiffAllStates(std::move(diff_all));
+        recentCommittedStores = std::move(oldCPU->recentCommittedStores);
+        syncVisibleStoreReplayArmed =
+            std::move(oldCPU->syncVisibleStoreReplayArmed);
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/base.cc` around lines 743 - 749, The CPU handover currently only
transfers diffAllStates via takeOverDiffAllStates, but it must also preserve
per-thread reconciliation state: copy recentCommittedStores and the
syncVisibleStoreReplayArmed flag from oldCPU to this CPU when enable_diff is
true; update the same handover site where getDiffAllStates() and
takeOverDiffAllStates(std::move(diff_all)) are called to call the appropriate
accessors/mutators (e.g., recentCommittedStores and syncVisibleStoreReplayArmed
on the source CPU and the destination CPU) so the replacement core retains
committed-store history and replay-arm bits after
takeOverFrom/takeOverDiffAllStates.
src/cpu/pred/btb/ras.cc (1)

20-27: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid undefined behavior in maxCtr initialization (BTBRAS)

src/cpu/pred/btb/ras.cc sets maxCtr((1 << ctrWidth) - 1) in both constructors; (1 << ctrWidth) is done as int (since 1 is an int literal) and maxCtr is also int (src/cpu/pred/btb/ras.hh). ctrWidth is a Param.Unsigned that is only range-checked to 32-bit unsigned (0..2^32-1), so a config with ctrWidth >= 31 can overflow/trigger undefined shift behavior. Use a wide unsigned literal/cast and guard the shift width (or widen/change maxCtr’s type).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/ras.cc` around lines 20 - 27, The initialization of maxCtr
in the BTBRAS constructors uses (1 << ctrWidth) which shifts a 32-bit int and
can UB when ctrWidth >= 31; update the BTBRAS constructors to compute maxCtr
using a widened unsigned type and guard the shift: e.g., perform the shift with
1ULL (or static_cast<uint64_t>(1)) and cap/validate ctrWidth against the
bit-width before shifting, or change maxCtr's type to an unsigned wide type
(e.g., uint64_t) in ras.hh and compute maxCtr as ((ctrWidth >=
(sizeof(maxCtr)*8)) ? std::numeric_limits<decltype(maxCtr)>::max() :
((decltype(maxCtr)1 << ctrWidth) - 1)); adjust both BTBRAS(...) initializers
accordingly.
🟠 Major comments (23)
src/cpu/base.cc-911-915 (1)

911-915: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the resolved CPU id when deriving difftest hart ids.

Line 914 uses params().cpu_id, but this constructor can auto-fix invalid IDs by writing _cpuId earlier in the file. If Python leaves cpu_id unset, difftest still gets the stale sentinel here and assigns the wrong hart id.

Suggested fix
 int
 BaseCPU::difftestHartId(ThreadID tid) const
 {
-    return params().cpu_id * numThreads + tid;
+    return _cpuId * numThreads + tid;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/base.cc` around lines 911 - 915, The difftest hart id calculation in
BaseCPU::difftestHartId uses params().cpu_id which may be the unresolved
sentinel; change it to use the resolved member CPU id (the field set earlier,
e.g., _cpuId or whatever member holds the fixed CPU id) so the function returns
(_cpuId * numThreads + tid) instead of (params().cpu_id * numThreads + tid);
update the reference in difftestHartId to the resolved ID member to ensure
correct hart assignment.
src/cpu/o3/commit.cc-1231-1237 (1)

1231-1237: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

This loop bypasses smtCommitPolicy.

The new path iterates *activeThreads directly and never consults getCommittingThread(), roundRobin(), or oldestReady(). After this change, non-default smtCommitPolicy settings stop affecting commit scheduling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/commit.cc` around lines 1231 - 1237, The loop that iterates
*activeThreads directly bypasses the SMT commit policy; replace the direct
iteration with the policy-driven selection used elsewhere by calling the
existing getCommittingThread / smtCommitPolicy selection path (e.g., use
getCommittingThread(), roundRobin(), or oldestReady() as the codebase expects)
so commit scheduling honors smtCommitPolicy. Specifically, stop looping over
activeThreads in commit.cc and instead obtain each next commit_thread via the
smtCommitPolicy/getCommittingThread mechanism and then check
commitStatus[commit_thread] before committing, preserving the original
policy-driven ordering and termination condition.
src/cpu/o3/decode.cc-90-97 (1)

90-97: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Account for the new stall buffer in reset and drain paths.

stallBuffer and eachstallSize become persistent decode state here, but clearStates(), isDrained(), and drainSanityCheck() still only reason about fixedbuffer. After a thread removal or CPU drain, stale fetch groups can survive in decode and be replayed later.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/decode.cc` around lines 90 - 97, clearStates(), isDrained(), and
drainSanityCheck() currently only handle fixedbuffer and ignore the new decode
state stallBuffer and eachstallSize; update those methods to account for them by
clearing stallBuffer and eachstallSize in clearStates(), including them in the
emptiness/drain checks in isDrained(), and verifying they are empty in
drainSanityCheck() (similar to how fixedbuffer is handled) so stale fetch groups
cannot survive thread removal or CPU drain.
src/cpu/o3/commit.cc-1227-1229 (1)

1227-1229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't mark the CPU active on zero-work commit cycles.

commit_width >= 0 is always true, so commit now calls activityThisCycle() even when no thread can retire anything. That can keep an empty core from going idle.

Suggested fix
-    if (commit_width >= 0) {
+    if (commit_width > 0) {
         cpu->activityThisCycle();
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/commit.cc` around lines 1227 - 1229, The code calls
cpu->activityThisCycle() whenever commit_width >= 0 which is always true; change
the condition inside the commit routine so activityThisCycle() is only invoked
when there was actual work retired (e.g., use commit_width > 0 or, better, check
the actual number of instructions retired this cycle such as a local
num_committed/count_retired variable) and call cpu->activityThisCycle() only
when that count > 0; update the condition around cpu->activityThisCycle() in
commit (the block referencing commit_width and cpu->activityThisCycle())
accordingly.
src/cpu/o3/commit.cc-1462-1485 (1)

1462-1485: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the 32-bit VSSTATUS view consistently in the vsxl == 1 path.

Line 1475 mixes vsstatus32.fs with vsstatus.vs. That recomputes sd from two different register views and can write the wrong architectural VSSTATUS.sd bit.

Suggested fix
-                                vsstatus32.sd = (vsstatus32.fs == 3) || (vsstatus.vs == 3);
+                                vsstatus32.sd = (vsstatus32.fs == 3) || (vsstatus32.vs == 3);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/commit.cc` around lines 1462 - 1485, In the
head_inst->isUpdateVsstatusSd() block, the vsxl == 1 path mixes the 32-bit view
and 64-bit view (uses vsstatus32.fs but vsstatus.vs) causing incorrect SD
computation; change the computation to use only the 32-bit view (vsstatus32.fs
and vsstatus32.vs) and ensure you write back (RegVal)vsstatus32 via
cpu->setMiscRegNoEffect(MISCREG_VSSTATUS, ... , tid) so the SD bit is derived
and stored consistently from the VSSTATUS32 view.
src/cpu/o3/commit.cc-1214-1241 (1)

1214-1241: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep total retirement bounded by commitWidth.

This now computes a local quota of up to commitWidth for each active thread and then sums them into commit_width. With 2 runnable threads, commit can retire up to 2 * commitWidth in one cycle, which breaks the stage bandwidth contract and also skews commitEligibleSamples.

Suggested fix
-    int commit_width = 0;
+    unsigned commit_width = 0;
     for (ThreadID tid : *activeThreads) {
-        commit_width_per_thread[tid] =
-            rob->countInstsOfGroups(tid, commitWidth);
+        const unsigned remaining = commitWidth - commit_width;
+        commit_width_per_thread[tid] =
+            rob->countInstsOfGroups(tid, remaining);
         commit_width += commit_width_per_thread[tid];
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/commit.cc` around lines 1214 - 1241, The code currently sums
per-thread quotas into commit_width allowing up to N * commitWidth retirements;
instead enforce a global cap of commitWidth by adjusting commit_width_per_thread
after computing per-thread counts: compute each thread's local quota via
rob->countInstsOfGroups(tid, commitWidth), then if the sum of those quotas
exceeds commitWidth, reduce quotas (e.g., iterate threads and decrement
per-thread quotas or allocate quotas round-robin/proportionally) until the total
<= commitWidth, set commit_width to that total, and keep the rest of the commit
loop using num_committed and num_committed_per_thread as before so the outer
while(num_committed < commit_width && ...) never allows more than commitWidth
retirements; update any bookkeeping that relies on commit_width or
commitEligibleSamples accordingly.
src/cpu/o3/decode.cc-188-196 (1)

188-196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Size the SMT stats with the configured thread count.

These vectors are indexed by tid in tick()/decodeInsts(), but they are always initialized to 4 entries. Any config with more than 4 hardware threads will index past the registered stats size.

Suggested fix
-    smtidleCycles
-            .init(4)
+    smtidleCycles
+            .init(cpu->numThreads)
             .flags(statistics::total);
     smtblockedCycles
-            .init(4)
+            .init(cpu->numThreads)
             .flags(statistics::total);    
     smtnotactiveCycles
-            .init(4)
+            .init(cpu->numThreads)
             .flags(statistics::total);          
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/decode.cc` around lines 188 - 196, smtidleCycles,
smtblockedCycles, and smtnotactiveCycles are sized to a hardcoded 4 but are
indexed by tid in tick()/decodeInsts(), causing OOB for configs with >4 hardware
threads; change their .init(4) calls to initialize with the configured SMT
thread count (use the CPU/decoder's SMT thread count accessor—e.g.,
numThreads(), threads(), or threadContexts().size() used elsewhere in this
module) so the stats vector length matches the actual number of hardware
threads.
src/cpu/o3/fetch.hh-249-251 (1)

249-251: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize the scheduler knobs from params instead of defaults here.

initDecodeScheduler() now branches on these members, but in the provided code smtDecodePolicy is effectively hard-wired to "multi_priority" and delayedSchedulerDelay has no deterministic value. That means the new fetch scheduler is not actually configurable, and the "delayed" path will read garbage if it ever becomes reachable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/fetch.hh` around lines 249 - 251, smtDecodePolicy and
delayedSchedulerDelay are not initialized from the configuration and are using a
hardwired default/garbage; update their initialization to read from the Params
(e.g., the Fetch/CPU params) in the class constructor or member initializer list
so initDecodeScheduler() sees the configured values — locate the member
declarations smtDecodePolicy and delayedSchedulerDelay in fetch.hh and set them
from the appropriate Params fields (the same names used in the param
definition), or call the param accessor in the constructor before
initDecodeScheduler() is invoked.
src/cpu/o3/fetch.cc-1183-1194 (1)

1183-1194: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Free late I-cache responses after squash.

Dropping queued retry packets here does not cover requests that were already accepted by the I-cache. Once cacheReq.reset() runs, those late responses hit the unmatched-request path in processMultiCacheLineCompletion() and return without deleting pkt, so repeated squash/retry traffic leaks Packet objects.

Possible fix
@@
     bool found_packet = threads[tid].cacheReq.markCompletedAndStorePacket(pkt);
     if (!found_packet) {
+        delete pkt;
+        ++fetchStats.icacheSquashes;
         DPRINTF(Fetch, "[tid:%i] Packet doesn't match current requests, deleting pkt %#lx\n",
                 tid, pkt->getAddr());
         DPRINTF(Fetch, "[tid:%i] Expected requests: ", tid);
         for (size_t i = 0; i < threads[tid].cacheReq.requests.size(); i++) {
             DPRINTF(Fetch, "req[%d]=0x%lx ", i, threads[tid].cacheReq.requests[i]->getVaddr());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/fetch.cc` around lines 1183 - 1194, The squash path currently
deletes queued retryPkt entries but leaves Packets already accepted by the
I-cache dangling (they hit the unmatched-request branch in
processMultiCacheLineCompletion() and are never deleted), so after
cacheReq.reset() ensure those late I-cache responses are freed: either (a) on
squash walk any container tracking outstanding accepted-cache Packets and
delete/erase them (similar to the retryPkt loop), or (b) update
processMultiCacheLineCompletion() to detect a squashed/unmatched response (no
cacheReq or thread mismatch) and delete the incoming Packet before returning;
reference symbols: retryPkt, cacheReq.reset(),
processMultiCacheLineCompletion(), Packet, cacheBlocked, cpu->contextToThread(),
tid.
src/cpu/o3/iew.cc-1611-1614 (1)

1611-1614: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear resolvedCFIs for every thread before repopulating it.

Fetch now consumes fromIEW->iewInfo[tid].resolvedCFIs per thread, but this function still clears that vector only for *activeThreads->begin() a few lines below. Any other active thread retains last cycle's resolved CFIs and fetch will keep re-enqueueing stale resolve updates for it.

Suggested change
     while (threads != end) {
         ThreadID tid = *threads++;
         fetchRedirect[tid] = false;
         toFetch->iewInfo[tid].ldstqCount = ldstQueue.getCount(tid);
         toFetch->iewInfo[tid].robCount = rob->getThreadEntries(tid);
         toFetch->iewInfo[tid].iqCount = scheduler->getIQInsts(tid);
+        toFetch->iewInfo[tid].resolvedCFIs.clear();
     }
-
-    // Clear resolvedFSQId and resolvedInstPC since they are already handled in frontend
-    ThreadID tid = *activeThreads->begin();
-    toFetch->iewInfo[tid].resolvedCFIs.clear();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/iew.cc` around lines 1611 - 1614, The code only clears
resolvedCFIs for *activeThreads->begin(), leaving other threads'
fromIEW->iewInfo[tid].resolvedCFIs stale; fix by iterating all tids in
activeThreads and calling resolvedCFIs.clear() (or equivalent) on each
fromIEW->iewInfo[tid] (or toFetch->iewInfo[tid] as appropriate) before
repopulating counts; place this loop alongside where you set
ldstqCount/robCount/iqCount so every active thread's resolvedCFIs is reset each
cycle.
src/cpu/o3/lsq.cc-1954-1958 (1)

1954-1958: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't collapse independent-mode stall state into a global stall.

With SMTLSQMode::Independent, lsqPolicy can still be Dynamic, so this makes one thread's replay stall the other. The global path should only apply in shared mode.

Suggested fix
-    if (lsqPolicy == SMTQueuePolicy::Dynamic ||
-        lsqPolicy == SMTQueuePolicy::DynamicBorrowing)
+    if (sharedLSQMode() &&
+        (lsqPolicy == SMTQueuePolicy::Dynamic ||
+         lsqPolicy == SMTQueuePolicy::DynamicBorrowing))
         return isStalled();
     else
         return thread[tid].isStalled();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/lsq.cc` around lines 1954 - 1958, The current branch returns the
global isStalled() whenever lsqPolicy is Dynamic/DynamicBorrowing, which
incorrectly collapses per-thread stalls when SMTLSQMode is Independent; update
the condition to only return global isStalled() when lsqPolicy is Dynamic or
DynamicBorrowing AND the LSQ mode is not SMTLSQMode::Independent (otherwise
return thread[tid].isStalled()); refer to lsqPolicy, SMTQueuePolicy::Dynamic,
SMTQueuePolicy::DynamicBorrowing, SMTLSQMode::Independent, isStalled(), and
thread[tid].isStalled() when making the change.
src/cpu/o3/lsq_unit.cc-845-857 (1)

845-857: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the queue slot before returning the “current” request.

These helpers only check lqIdx/sqIdx >= 0, but replay/squash/commit can leave a stale index that now points at a recycled LSQ slot. That can return another instruction's request and reintroduce the exact stale-request bug this change is trying to remove. Guard with isValidIdx(), valid(), and a seqNum match before handing the pointer out.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/lsq_unit.cc` around lines 845 - 857, currentLoadRequest and
currentStoreRequest currently only check lqIdx/sqIdx >= 0 and can return a
recycled slot; update both to validate the indexed slot before returning: for
currentLoadRequest check inst && inst->lqIdx >= 0 && isValidIdx(inst->lqIdx) &&
loadQueue[inst->lqIdx].valid() && loadQueue[inst->lqIdx].seqNum == inst->seqNum
(then return request()), otherwise return nullptr; do the analogous checks in
currentStoreRequest using sqIdx, storeQueue, and the slot's seqNum to ensure the
slot still belongs to this instruction. Ensure you reference the
loadQueue/storeQueue slot validation helpers (isValidIdx, valid) and compare
seqNum on the slot against inst->seqNum.
src/cpu/o3/lsq_unit.cc-1076-1077 (1)

1076-1077: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

replay_executed_loads is ignored, so visible atomic/sync stores never replay completed loads.

completeStore() computes replay_executed_loads specifically for atomic and sync-visible stores, but this helper immediately marks it unused and always skips ld_inst->isExecuted(). That drops the path meant to invalidate already-issued conflicting loads after another hart makes a store visible.

Also applies to: 1107-1123

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/lsq_unit.cc` around lines 1076 - 1077, The code currently marks
replay_executed_loads and visible_store_seq as [[maybe_unused]] and always skips
executed loads (using ld_inst->isExecuted()), which prevents atomic/sync-visible
stores from replaying already-executed conflicting loads; update completeStore()
to stop ignoring replay_executed_loads (remove the [[maybe_unused]]), and modify
the load-replay condition where ld_inst->isExecuted() is checked so that
executed loads are considered for replay when replay_executed_loads is true
(e.g., only skip executed loads if replay_executed_loads is false); apply the
same fix to the corresponding load-handling code paths noted around the
1107–1123 region so visible_store_seq and replay_executed_loads drive replay
behavior correctly.
src/cpu/o3/rename.cc-430-432 (1)

430-432: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't index smtStallEvents with raw StallReason.

smtStallEvents is sized by StallEventCount, and the code above already maps StallReason into that smaller enum. Indexing it directly with stallSig->renameBlockReason[tid] can hit the wrong bucket—or walk past the vector for reasons like OtherFragStall or bandwidth-specific stalls.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/rename.cc` around lines 430 - 432, The code indexes
stats.smtStallEvents with a raw StallReason (stallSig->renameBlockReason[tid])
which can be out-of-range because smtStallEvents is sized by StallEventCount;
instead map the StallReason into the reduced StallEvent enum before indexing. In
the rename.cc block around the if (stallSig->blockRename[tid]) branch, replace
direct use of stallSig->renameBlockReason[tid] when calling setAllStalls and
stats.smtStallEvents[...] with the mapped event value (the function or mapping
that converts StallReason → StallEventCount-based enum), then use that mapped
enum to call setAllStalls and stats.smtStallEvents[mappedEvent].sample(tid).
src/cpu/o3/lsq_unit.cc-365-368 (1)

365-368: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't assert response size against inst->effSize for split loads.

For split loads, completeDataAccess() runs once per fragment, so pkt->getSize() can be smaller than the full instruction size. This will fire as soon as a split load in pmem goes through the multi-context difftest path; compare against the fragment size instead, or gate the equality check on !request->isSplit().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/lsq_unit.cc` around lines 365 - 368, The assert comparing
pkt->getSize() to inst->effSize is invalid for split loads; in
completeDataAccess() replace the unconditional assert(size == inst->effSize)
with a check that uses the fragment size for split requests (e.g. use
request->getSize() when request->isSplit() is true) or gate the equality on
!request->isSplit(); i.e. ensure you compare size to request->getSize() for
fragments or to inst->effSize only for non-split requests (refer to
pkt->getSize(), inst->effSize, request->isSplit(), and request->getSize()).
src/cpu/o3/rob.hh-233-239 (1)

233-239: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

isFull(tid) is checking total ROB size instead of the thread limit.

The non-DynamicBorrowing path compares threadGroups[tid].size() against numEntries, but partitioned and threshold modes cap each thread at maxEntries[tid]. That will report "not full" long after the thread has exhausted its own quota, so upstream stages can over-issue until a later assertion fires.

Suggested fix
-      return threadGroups[tid].size() == numEntries;
+      return threadGroups[tid].size() >= maxEntries[tid];
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/rob.hh` around lines 233 - 239, isFull(ThreadID tid) currently
returns threadGroups[tid].size() == numEntries for non-DynamicBorrowing modes
which checks against the total ROB size; change it to check the per-thread limit
by comparing threadGroups[tid].size() (or equivalently numFreeEntries(tid))
against maxEntries[tid] so partitioned/threshold modes enforce each thread's
quota; keep the existing DynamicBorrowing branch (robPolicy ==
SMTQueuePolicy::DynamicBorrowing) that uses numFreeEntries(tid) == 0 and replace
the fallback comparison to use maxEntries[tid] instead of numEntries.
src/cpu/pred/btb/btb_mgsc.cc-891-895 (1)

891-895: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Train MGSC weights with the same ASID-aware key used for lookup.

findWeight() now reads from a PC+ASID bucket, but the training path still ignores that key: weightTableIdx is computed here and then updateWeightTable() re-hashes by PC without ASID. Once asidHash != 0, prediction and training hit different entries, so the active address space never converges its weights.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/btb_mgsc.cc` around lines 891 - 895, The training path
computes a PC-only index but should use the same PC+ASID key used for lookup;
change the code that computes weightTableIdx in the training branch to call
getPcIndex(stream.startPC, weightTableIdxWidth, stream.asidHash) (same
parameters used by findWeight()) and pass that index into updateWeightTable (or
change updateWeightTable to accept an externally computed index) so
updateWeightTable does not re-hash by PC alone; ensure all calls to
updateWeightTable from this training path use the ASID-aware index so prediction
and training access the same bucket when stream.asidHash != 0.
src/cpu/pred/btb/abtb.cc-398-438 (1)

398-438: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the ahead-pipeline queue keyed to the live BTB set.

This path still snapshots a full BTBSet into aheadReadBtbEntries. When the queued lookup is consumed, hit handling updates the copied entries while the real set in btb[idx_prvStartpc] keeps its old recency state, so normal hits stop refreshing the live replacement metadata. Please queue only the prior block/set identity and re-read the live set when the ahead stage matures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/abtb.cc` around lines 398 - 438, The ahead-pipeline
currently snapshots a full BTBSet into state.aheadReadBtbEntries in
AheadBTB::lookupSingleBlock which causes stale recency metadata; instead push
only the identifying info (e.g., prior block PC and its btb index/ASID hash)
into state.aheadReadBtbEntries and when the entry is popped re-read the live BTB
set from btb[idx_prvStartpc] (or via getIndex/getTag as appropriate) before
doing tag compares and hit handling; update uses of std::tie(pc, idx_prvStartpc,
set) so the popped tuple contains no BTBSet, then fetch BTBSet set =
btb[idx_prvStartpc] (or equivalent) before iterating/printing or updating
entries (affecting functions/variables: AheadBTB::lookupSingleBlock,
state.aheadReadBtbEntries, btb, idx_prvStartpc, getIndex/getTag,
aheadPipelinedStages, printTickedBTBEntry).
src/cpu/pred/btb/btb_tage_ub.cc-325-327 (1)

325-327: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make the upper-bound predictor state truly thread-local.

Discarding tid here leaves the new thread-aware API as a no-op, while this class still keeps a single exactPathHistory. In HistorySource::PathHash mode, different SMT threads will overwrite each other’s speculative path history, so update/recovery can run against the wrong context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/btb_tage_ub.cc` around lines 325 - 327,
BTBTAGEUpperBound::getPredictionMeta currently ignores the ThreadID and shares a
single exactPathHistory across threads, causing SMT threads to overwrite each
other's speculative path history (HistorySource::PathHash mode); make the
upper-bound predictor state truly thread-local by storing per-thread
exactPathHistory (e.g., a vector/array or thread-local map keyed by ThreadID)
and change BTBTAGEUpperBound::getPredictionMeta to index into that per-thread
slot using tid so updates and recoveries use the correct context for each
thread.
src/cpu/pred/btb/decoupled_bpred.cc-567-575 (1)

567-575: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing-target squash only rewinds the PC, not the thread history.

This path clears the FTQ and redirects s0PC, but it leaves the thread’s speculative histories and historyManagers[tid] untouched. The next prediction on that thread will reuse stale history, and the folded-history predictors will diverge from the redirected control flow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/decoupled_bpred.cc` around lines 567 - 575, The
squash-on-missing-target path rewinds only FTQ and s0PC but leaves per-thread
speculative histories intact, so update the handler (after ftq.clear(tid);
clearPreds(tid); threads[tid].validprediction = false; threads[tid].s0PC =
redirect_pc;) to also rewind/clear the thread’s speculative history and
folded-history manager state for tid: invoke the appropriate reset/clear/squash
API on historyManagers[tid] and any per-thread speculative-history structures so
the folded-history predictors are synchronized with the new redirect_pc and
cannot reuse stale history.
src/cpu/pred/btb/mbtb.cc-302-305 (1)

302-305: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep victim-cache identity ASID-aware on the update path too.

These changes make MBTB lookup/tagging ASID-aware, but the victim-cache write/dedup path still matches entries by pc only. That lets same-PC branches from different ASIDs overwrite or erase each other in the victim cache, so a later update can corrupt another address space's predictor state.

Suggested direction
-        if (vc_entry.valid && vc_entry.pc == entry.pc) {    // pc is tag compared
+        const Addr entry_tag = getTag(entry.pc, stream.asidHash);
+        if (vc_entry.valid && vc_entry.pc == entry.pc &&
+            vc_entry.tag == entry_tag) {

Apply the same ASID-aware keying in insertVictimCache() and eraseFromVictimCacheByPC() so lookup, update, insert, and dedup all agree on what constitutes the same victim-cache entry.

Also applies to: 464-465, 512-568, 727-740

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/mbtb.cc` around lines 302 - 305, The victim-cache
update/dedup path is currently matching entries by pc only, which allows entries
from different ASIDs to collide; update insertVictimCache() and
eraseFromVictimCacheByPC() to use the same ASID-aware keying as lookup() (use
the asidHash computed from stagePreds.front().asidHash or the same metadata
path) so all victim-cache operations compare both pc and asidHash when matching
entries; ensure any dedup/erase logic uses the combined (pc, asidHash) identity
so victim entries for different ASIDs cannot overwrite or be erased by each
other.
src/cpu/valuepred/enhanced_stride.cc-125-128 (1)

125-128: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix inflightWindowTagLength validation and 64-bit-safe mask shifts to avoid UB

  • gem5_assert(params.inflightWindowTagLength, ...) only checks non-zero; negative values pass—assert params.inflightWindowTagLength > 0.
  • enhanced_stride.cc builds masks with expressions like ((1 << targetLength) - 1) / ((1 << tagWidth) - 1). With ValuePredictor.py defaulting inflightWindowTagLength = Param.Int(64, ...), shifting by 64 with a 32-bit 1 can trigger undefined behavior—cap the parameter (and/or use a uint64_t-based mask with a len >= 64 fallback).
Suggested fix
@@
-    gem5_assert(params.inflightWindowTagLength, "EStride inflightWindowTagLength must > 0 \n");
+    gem5_assert(params.inflightWindowTagLength > 0,
+                "EStride inflightWindowTagLength must > 0 \n");
+    gem5_assert(params.inflightWindowTagLength <= 64,
+                "EStride inflightWindowTagLength must <= 64 \n");

Update the mask generation to use uint64_t(1) (and handle len >= 64 as ~uint64_t(0)), so ((1ULL << len) - 1) is never evaluated with len == 64.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/valuepred/enhanced_stride.cc` around lines 125 - 128, The assert for
the inflight window tag length is too weak and the mask math uses 32-bit shifts
causing UB for 64; change the check that validates
params.inflightWindowTagLength to require > 0 (not just non-zero) and update
mask construction in enhanced_stride.cc to use uint64_t (e.g., uint64_t(1) or
1ULL) and handle len >= 64 by returning ~uint64_t(0) (all-ones) instead of
performing (1ULL << 64) - 1; ensure every use of inflightWindowTagLength or
tagWidth that builds a mask follows this uint64_t-safe pattern so shifts never
invoke UB.
src/mem/physical.cc-855-857 (1)

855-857: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore loop may leave stale bytes for zero words

At Line 855, the new read_data != 0 gate skips writing zero words, so previously non-zero destination bytes can survive restore. This can corrupt reconstructed memory when backing storage isn’t guaranteed zeroed.

💡 Suggested fix
             for (uint64_t x = 0; x < output.pos; x += sizeof(long)) {
                 uint64_t read_data = *(decompress_file_buffer + x / sizeof(long));
-                if (read_data != 0) {
-                    pmem_current = (uint64_t*)(pmem + total_write_size + x);
+                pmem_current = (uint64_t*)(pmem + total_write_size + x);
+                if (read_data != 0 || *pmem_current != 0) {
                     *pmem_current = read_data;
-                    non_zero_dword++;
+                    if (read_data != 0) {
+                        non_zero_dword++;
+                    }
                 }
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/mem/physical.cc` around lines 855 - 857, The restore loop currently skips
writing when read_data == 0 which can leave previously non-zero bytes in pmem;
change the logic in the restore loop around variables read_data, pmem_current,
pmem, total_write_size and x so that you always write the word to the
destination (i.e., remove the read_data != 0 gate) or, if you keep the branch,
explicitly write zero for the zero case (e.g., *pmem_current = 0), ensuring no
stale bytes remain after restore.
🟡 Minor comments (4)
configs/common/Options.py-352-353 (1)

352-353: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix typo in --smt help text.

The help string has a typo on Line 353 (multitThread-supported), which shows up in CLI docs/help output.

Suggested fix
-    parser.add_argument("--smt", action="store_true", default=False,
-                        help=""" RISCV SMT support, which requires multitThread-supported gcpt restore and diff-ref-so""")
+    parser.add_argument("--smt", action="store_true", default=False,
+                        help="""RISC-V SMT support, which requires multi-thread-supported gcpt restore and diff-ref-so""")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@configs/common/Options.py` around lines 352 - 353, Update the help text for
the parser.add_argument("--smt", action="store_true", default=False, help=...)
entry to fix the typo "multitThread-supported" — change it to
"multithread-supported" (or "multi-thread-supported" if you prefer hyphenation)
so the help string reads e.g. "RISCV SMT support, which requires
multithread-supported gcpt restore and diff-ref-so".
src/cpu/o3/commit.cc-115-125 (1)

115-125: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the current thread when dumping a stuck ROB head.

Line 115 still reads rob->readHeadInst(0) inside the per-thread loop, so a stall on thread tid != 0 reports thread 0's head or incorrectly says the ROB is empty.

Suggested fix
-                if (auto inst = rob->readHeadInst(0)) {
+                if (auto inst = rob->readHeadInst(tid)) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/commit.cc` around lines 115 - 125, The code is calling
rob->readHeadInst(0) inside a per-thread loop which uses tid, causing dumps to
show thread 0's head; change the call to use the current thread variable (tid)
instead of 0 so readHeadInst(tid) retrieves the correct thread's head, and
likewise ensure getInstList(tid) and the debug_insts population/dump logic
remain consistent with tid when building and printing the stalled-ROB
instruction list (symbols: readHeadInst, rob, tid, getInstList, debug_insts).
src/cpu/pred/btb/btb_tage.cc-472-473 (1)

472-473: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the unreachable empty-vector fallback.

Line 472 already asserts stagePreds is non-empty via predictorTid(stagePreds), so the stagePreds.empty() fallback on Line 473 can never run. If an empty vector is legal here, this still aborts; otherwise simplify the code to a single non-empty contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/btb_tage.cc` around lines 472 - 473, The fallback that
checks stagePreds.empty() is unreachable because predictorTid(stagePreds)
already requires a non-empty vector; replace the ternary with a direct access to
stagePreds.front().asidHash (i.e., set asidHash := stagePreds.front().asidHash)
and remove the empty-vector branch. If empty stagePreds should be considered
legal, instead add an explicit guard/early return or assert before calling
predictorTid and compute asidHash only when non-empty; otherwise keep the
simplified non-empty contract and remove the redundant check.
src/cpu/valuepred/valuepred_unit.hh-30-30 (1)

30-30: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename assertValidTid to assert_valid_tid to match lower_snake_case.

assertValidTid appears in src/cpu/valuepred/valuepred_unit.hh and is defined in src/cpu/valuepred/valuepred_unit.cc, with call sites in src/cpu/valuepred/ideal_constant_lvp.cc and src/cpu/valuepred/enhanced_stride.cc; update the declaration, definition, and all call sites to assert_valid_tid.

Suggested rename
-    void assertValidTid(ThreadID tid) const;
+    void assert_valid_tid(ThreadID tid) const;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/valuepred/valuepred_unit.hh` at line 30, The declaration and uses of
assertValidTid need to be renamed to lower_snake_case: change the function
declaration and definition from assertValidTid(...) to assert_valid_tid(...),
update its definition in valuepred_unit.cc, and update all call sites (e.g.,
calls in ideal_constant_lvp.cc and enhanced_stride.cc) to call assert_valid_tid;
keep the signature and semantics unchanged and run a build to catch any
remaining references.
🧹 Nitpick comments (2)
configs/example/smt_idealkmhv3.py (1)

12-13: ⚡ Quick win

Rename helper to snake_case for guideline compliance.

setSharedLSQParams should use lower_snake_case.

Suggested fix
-def setSharedLSQParams(args, system):
+def set_shared_lsq_params(args, system):
     setKmhV3IdealParams(args, system)
@@
-    setSharedLSQParams(args, test_sys)
+    set_shared_lsq_params(args, test_sys)

As per coding guidelines: "Functions and methods should use lower_snake_case naming convention".

Also applies to: 45-45

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@configs/example/smt_idealkmhv3.py` around lines 12 - 13, Rename the helper
function setSharedLSQParams to follow snake_case: set_shared_lsq_params; update
its definition and all references (e.g., callers that invoke setSharedLSQParams
and the helper call to setKmhV3IdealParams inside it) including the other
occurrence noted (line 45) so they call set_shared_lsq_params, and ensure any
imports or exports use the new name to keep behavior unchanged.
src/cpu/pred/btb/test/abtb.test.cc (1)

44-63: ⚡ Quick win

Use lower_snake_case for newly introduced helper functions.

makePrediction and clearAheadPipeline should follow the repository naming rule for functions/methods.

Suggested rename
-FullBTBPrediction makePrediction(Addr startPC, AheadBTB *abtb,
-                                 ThreadID tid = 0, uint8_t asidHash = 0) {
+FullBTBPrediction make_prediction(Addr start_pc, AheadBTB *abtb,
+                                  ThreadID tid = 0, uint8_t asid_hash = 0) {
@@
-void clearAheadPipeline(AheadBTB *abtb, ThreadID tid) {
+void clear_ahead_pipeline(AheadBTB *abtb, ThreadID tid) {

As per coding guidelines, "Functions and methods should use lower_snake_case naming convention".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/pred/btb/test/abtb.test.cc` around lines 44 - 63, The helper
functions violate naming conventions—rename makePrediction and
clearAheadPipeline to lower_snake_case (e.g., make_prediction and
clear_ahead_pipeline) keeping their signatures the same (makePrediction(Addr
startPC, AheadBTB *abtb, ThreadID tid = 0, uint8_t asidHash = 0) and
clearAheadPipeline(AheadBTB *abtb, ThreadID tid)), then update all
references/call sites in this test file (and any headers) to the new names so
the code compiles and follows the repository's function naming rule.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a29d618b-a581-4652-ae35-442a8393b911

📥 Commits

Reviewing files that changed from the base of the PR and between 7930d36 and 8d75755.

📒 Files selected for processing (92)
  • .gitignore
  • configs/common/FSConfig.py
  • configs/common/Options.py
  • configs/common/xiangshan.py
  • configs/example/smt_idealkmhv3.py
  • src/arch/riscv/isa/vector/base/vector_mem.temp.isa
  • src/arch/riscv/isa/vector/simple/vector_mem.temp.isa
  • src/arch/riscv/tlb.cc
  • src/arch/riscv/tlb.hh
  • src/cpu/base.cc
  • src/cpu/base.hh
  • src/cpu/difftest.cc
  • src/cpu/difftest.hh
  • src/cpu/o3/BaseO3CPU.py
  • src/cpu/o3/FuncScheduler.py
  • src/cpu/o3/SConscript
  • src/cpu/o3/comm.hh
  • src/cpu/o3/commit.cc
  • src/cpu/o3/commit.hh
  • src/cpu/o3/cpu.cc
  • src/cpu/o3/cpu.hh
  • src/cpu/o3/decode.cc
  • src/cpu/o3/decode.hh
  • src/cpu/o3/dyn_inst.hh
  • src/cpu/o3/fetch.cc
  • src/cpu/o3/fetch.hh
  • src/cpu/o3/iew.cc
  • src/cpu/o3/iew.hh
  • src/cpu/o3/inst_queue.cc
  • src/cpu/o3/inst_queue.hh
  • src/cpu/o3/issue_queue.cc
  • src/cpu/o3/issue_queue.hh
  • src/cpu/o3/lsq.cc
  • src/cpu/o3/lsq.hh
  • src/cpu/o3/lsq_unit.cc
  • src/cpu/o3/lsq_unit.hh
  • src/cpu/o3/rename.cc
  • src/cpu/o3/rename.hh
  • src/cpu/o3/rob.cc
  • src/cpu/o3/rob.hh
  • src/cpu/o3/smt_sched.hh
  • src/cpu/pred/BranchPredictor.py
  • src/cpu/pred/SConscript
  • src/cpu/pred/btb/abtb.cc
  • src/cpu/pred/btb/abtb.hh
  • src/cpu/pred/btb/btb_ittage.cc
  • src/cpu/pred/btb/btb_ittage.hh
  • src/cpu/pred/btb/btb_mgsc.cc
  • src/cpu/pred/btb/btb_mgsc.hh
  • src/cpu/pred/btb/btb_tage.cc
  • src/cpu/pred/btb/btb_tage.hh
  • src/cpu/pred/btb/btb_tage_ub.cc
  • src/cpu/pred/btb/btb_tage_ub.hh
  • src/cpu/pred/btb/btb_ubtb.cc
  • src/cpu/pred/btb/btb_ubtb.hh
  • src/cpu/pred/btb/common.hh
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/decoupled_bpred.hh
  • src/cpu/pred/btb/ftq.cc
  • src/cpu/pred/btb/ftq.hh
  • src/cpu/pred/btb/mbtb.cc
  • src/cpu/pred/btb/mbtb.hh
  • src/cpu/pred/btb/microtage.cc
  • src/cpu/pred/btb/microtage.hh
  • src/cpu/pred/btb/ras.cc
  • src/cpu/pred/btb/ras.hh
  • src/cpu/pred/btb/test/abtb.test.cc
  • src/cpu/pred/btb/test/btb_tage.test.cc
  • src/cpu/pred/btb/timed_base_pred.hh
  • src/cpu/pred/btb/uras.cc
  • src/cpu/pred/btb/uras.hh
  • src/cpu/simple/base.cc
  • src/cpu/simple/base.hh
  • src/cpu/valuepred/ValuePredictor.py
  • src/cpu/valuepred/enhanced_stride.cc
  • src/cpu/valuepred/enhanced_stride.hh
  • src/cpu/valuepred/ideal_constant_lvp.cc
  • src/cpu/valuepred/ideal_constant_lvp.hh
  • src/cpu/valuepred/valuepred_metadata.hh
  • src/cpu/valuepred/valuepred_unit.cc
  • src/cpu/valuepred/valuepred_unit.hh
  • src/dev/riscv/HartCtrl.py
  • src/dev/riscv/SConscript
  • src/dev/riscv/hart_ctrl.cc
  • src/dev/riscv/hart_ctrl.hh
  • src/mem/cache/prefetch/base.cc
  • src/mem/cache/prefetch/base.hh
  • src/mem/cache/prefetch/queued.cc
  • src/mem/cache/prefetch/worker.cc
  • src/mem/physical.cc
  • src/sim/system.cc
  • src/sim/system.hh

Comment thread src/cpu/o3/commit.cc
Comment on lines +1248 to +1257
if (interrupt != NoFault) {
// If inside a transaction, postpone interrupts
if (executingHtmTransaction(commit_thread)) {
cpu->clearInterrupts(0);
toIEW->commitInfo[0].clearInterrupt = true;
interrupt = NoFault;
avoidQuiesceLiveLock = true;
} else {
handleInterrupt();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Only the interrupt-handling thread should defer or consume interrupts.

Interrupts are still handled on thread 0, but this block clears interrupt when any commit_thread is inside HTM. If thread 1 is transactional while thread 0 has a pending interrupt, the interrupt is dropped before thread 0 ever processes it.

Suggested fix
-            if (interrupt != NoFault) {
-                // If inside a transaction, postpone interrupts
-                if (executingHtmTransaction(commit_thread)) {
+            if (commit_thread == 0 && interrupt != NoFault) {
+                // If inside a transaction, postpone interrupts
+                if (executingHtmTransaction(0)) {
                     cpu->clearInterrupts(0);
                     toIEW->commitInfo[0].clearInterrupt = true;
                     interrupt = NoFault;
                     avoidQuiesceLiveLock = true;
                 } else {
                     handleInterrupt();
                 }
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (interrupt != NoFault) {
// If inside a transaction, postpone interrupts
if (executingHtmTransaction(commit_thread)) {
cpu->clearInterrupts(0);
toIEW->commitInfo[0].clearInterrupt = true;
interrupt = NoFault;
avoidQuiesceLiveLock = true;
} else {
handleInterrupt();
}
if (commit_thread == 0 && interrupt != NoFault) {
// If inside a transaction, postpone interrupts
if (executingHtmTransaction(0)) {
cpu->clearInterrupts(0);
toIEW->commitInfo[0].clearInterrupt = true;
interrupt = NoFault;
avoidQuiesceLiveLock = true;
} else {
handleInterrupt();
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/commit.cc` around lines 1248 - 1257, The code currently clears and
consumes the pending interrupt whenever any commit_thread is inside HTM
(executingHtmTransaction(commit_thread)), which can drop an interrupt destined
for thread 0; change the logic so only the interrupt-handling thread (thread 0)
may defer/consume the interrupt. Concretely, in the block using
executingHtmTransaction(commit_thread), add a check that commit_thread is the
interrupt-handling thread (i.e., commit_thread corresponds to thread 0) before
calling cpu->clearInterrupts(0), setting toIEW->commitInfo[0].clearInterrupt =
true, clearing interrupt = NoFault, and setting avoidQuiesceLiveLock; if
commit_thread is not thread 0, do not clear or consume the interrupt and instead
call handleInterrupt() so the correct thread processes it.

Comment thread src/cpu/o3/lsq.cc
Comment on lines +1033 to +1041
// A fence/flush only waits for the requesting thread's sbuffer domain.
for (ThreadID tid = 0; tid < numThreads; ++tid) {
if (!storeBufferFlushing(tid) ||
!storeBufferEmpty(tid, _storeBufferFlushBeforeSeq[tid])) {
continue;
}

clearStoreBufferFlushing(tid);
cpu->activityThisCycle();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Keep the flush armed until the matching SQ range is empty.

This now clears _storeBufferFlushing[tid] based only on the store-buffer domain. If the thread still has older committed stores waiting in SQ, a fence can observe "flush complete" before those stores ever offload.

Suggested fix
-    // A fence/flush only waits for the requesting thread's sbuffer domain.
+    // A fence/flush only completes once the requesting thread has no matching
+    // stores left in either SQ or the store buffer.
     for (ThreadID tid = 0; tid < numThreads; ++tid) {
-        if (!storeBufferFlushing(tid) ||
-            !storeBufferEmpty(tid, _storeBufferFlushBeforeSeq[tid])) {
+        const auto flush_before = _storeBufferFlushBeforeSeq[tid];
+        const bool pending_sq =
+            flush_before == static_cast<InstSeqNum>(-1) ?
+                hasStoresToWB(tid) :
+                hasStoresToWBBefore(tid, flush_before);
+        if (!storeBufferFlushing(tid) || pending_sq ||
+            !storeBufferEmpty(tid, flush_before)) {
             continue;
         }
 
         clearStoreBufferFlushing(tid);
         cpu->activityThisCycle();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/lsq.cc` around lines 1033 - 1041, The flush flag
`_storeBufferFlushing[tid]` is being cleared as soon as the store-buffer domain
is empty, but we must keep it armed until the matching store-queue (SQ) range up
to `_storeBufferFlushBeforeSeq[tid]` is also drained; change the condition
around clearStoreBufferFlushing(tid) so it only clears when both
storeBufferEmpty(tid, _storeBufferFlushBeforeSeq[tid]) and the corresponding SQ
is empty (e.g. add a check like storeQueueEmpty(tid,
_storeBufferFlushBeforeSeq[tid]) or equivalent), leaving
`_storeBufferFlushBeforeSeq` and cpu->activityThisCycle() behavior unchanged and
thus preventing a fence from seeing "flush complete" before older committed
stores offload.

Comment thread src/cpu/o3/lsq.cc
Comment on lines +2160 to +2185
if (!entry->sending) {
return;
}

for (size_t i = 0; i < mask.size(); ++i) {
if (mask[i] &&
entry->byteGenerations[offset + i] != 0 &&
entry->byteGenerations[offset + i] <= generation) {
entry->validMask[offset + i] = false;
}
}
};

for (ThreadID other_tid = 0; other_tid < numThreads; ++other_tid) {
if (other_tid == tid) {
continue;
}

auto entry = storeBuffer.get(other_tid, block_paddr);
if (!entry) {
continue;
}

invalidate_entry(entry);
invalidate_entry(entry->vice);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Invalidate stale vice bytes even before the vice starts sending.

The early !entry->sending return leaves older vice bytes intact. After the current entry drains and the vice is promoted, another thread can forward bytes that were already overwritten by a later visible store.

Suggested fix
     auto invalidate_entry = [&](StoreBufferEntry *entry) {
         if (!entry || offset + mask.size() > entry->validMask.size()) {
             return;
         }
-
-        if (!entry->sending) {
-            return;
-        }
 
         for (size_t i = 0; i < mask.size(); ++i) {
             if (mask[i] &&
                 entry->byteGenerations[offset + i] != 0 &&
                 entry->byteGenerations[offset + i] <= generation) {
                 entry->validMask[offset + i] = false;
+                entry->byteGenerations[offset + i] = 0;
             }
         }
     };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/lsq.cc` around lines 2160 - 2185, The lambda invalidate_entry
currently returns early when entry->sending, which prevents clearing stale bytes
in the vice; change it so that it does not skip processing vice entries — either
remove the early "if (!entry->sending) return;" or alter the logic to only skip
invalidation for the primary entry but still iterate and clear bytes for
entry->vice (using entry->byteGenerations, entry->validMask, mask, offset, and
generation). Ensure you still guard null entry pointers (storeBuffer.get and
entry->vice) but always run the invalidation loop for vice entries so
overwritten bytes are cleared before the vice begins sending.

Comment thread src/cpu/o3/rob.cc
Comment on lines 457 to 461
ThreadID tid = inst->threadNumber;
assert(canAllocate(tid, 1));

// allocate group
bool alloc = (this->*allocateNewGroup)(inst, tid);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Only charge ROB capacity when a new group is actually allocated.

canAllocate(tid, 1) runs before allocateNewGroup() decides whether this instruction consumes another ROB entry. Under compressed-group policies, an instruction that extends the current group should still be accepted even when the thread is at its group limit; this assert turns that legal case into a false full condition.

Suggested fix
-    ThreadID tid = inst->threadNumber;
-    assert(canAllocate(tid, 1));
-
-    // allocate group
-    bool alloc = (this->*allocateNewGroup)(inst, tid);
+    ThreadID tid = inst->threadNumber;
+
+    // allocate group
+    bool alloc = (this->*allocateNewGroup)(inst, tid);
+    assert(!alloc || canAllocate(tid, 1));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ThreadID tid = inst->threadNumber;
assert(canAllocate(tid, 1));
// allocate group
bool alloc = (this->*allocateNewGroup)(inst, tid);
ThreadID tid = inst->threadNumber;
// allocate group
bool alloc = (this->*allocateNewGroup)(inst, tid);
assert(!alloc || canAllocate(tid, 1));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cpu/o3/rob.cc` around lines 457 - 461, The assert calling
canAllocate(tid, 1) is executed before allocateNewGroup() decides if a new ROB
entry is needed, which blocks legal compressed-group extensions; remove or defer
that check and only charge/assert when a new group is actually allocated: call
(this->*allocateNewGroup)(inst, tid) first (or inspect its return value `alloc`)
and then if alloc is true perform canAllocate(tid, 1) / assert; reference the
ThreadID tid from inst->threadNumber, the allocateNewGroup call and the boolean
`alloc` to gate the capacity assertion.

HartCtrl::read(PacketPtr pkt)
{
assert(pkt->getAddr() >= pioAddr && pkt->getAddr() < pioAddr + pioSize);
assert(pkt->getSize() > 0 && pkt->getSize() <= sizeof(uint64_t));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify Packet typed accessor contract and current HartCtrl usage.
set -euo pipefail

echo "[1] Locate Packet accessor declarations/definitions"
fd -i 'packet_access.hh'
fd -i 'packet.hh'

echo "[2] Inspect setLE/getLE and any size assertions"
rg -n -C3 'setLE<|getLE<|setRaw<|getRaw<|getSize\(\)' src mem include

echo "[3] Inspect HartCtrl read/write size handling"
rg -n -C5 'HartCtrl::read|HartCtrl::write|setLE\(|getLE<' src/dev/riscv/hart_ctrl.cc

Repository: OpenXiangShan/GEM5

Length of output: 50437


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "## hart_ctrl.cc: show lines 1-120 (includes referenced line 24 and ~36-37)"
nl -ba src/dev/riscv/hart_ctrl.cc | sed -n '1,120p'

echo "## hart_ctrl.cc: show lines around any setLE usage"
rg -n "setLE\(|setLE<|getLE\(|setRaw<|setRaw\(" src/dev/riscv/hart_ctrl.cc
echo
for n in 24 36 37; do
  echo "## context around line $n"
  nl -ba src/dev/riscv/hart_ctrl.cc | sed -n "$((n-10)),$((n+20))p"
  echo
done

echo "## packet_access.hh: locate setLE/getLE/setRaw definitions"
rg -n "template <typename T>.*setLE|setLE<|setLE\(|setRaw<|getRaw<|getLE<|Packet::setLE" src/mem/packet_access.hh src/mem/packet_access* || true
echo

echo "## packet_access.hh: print section covering setLE (best-effort window)"
# print a reasonable window around template setLE if rg found a line; otherwise print around where setLE likely is
nl -ba src/mem/packet_access.hh | sed -n '1,220p'

Repository: OpenXiangShan/GEM5

Length of output: 198


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "## hart_ctrl.cc: show lines 1-130 (with line numbers)"
awk '{printf "%5d\t%s\n", NR, $0}' src/dev/riscv/hart_ctrl.cc | sed -n '1,130p'

echo
echo "## hart_ctrl.cc: show setLE/getLE occurrences"
rg -n "setLE<|setLE\(|getLE<|getLE\(|setRaw<|setRaw\(|getRaw<|getRaw\(" src/dev/riscv/hart_ctrl.cc || true

echo
echo "## hart_ctrl.cc: context around any setLE calls"
# Grab some wider contexts around each match line (max 30 lines around)
rg -n "setLE<|setLE\(" src/dev/riscv/hart_ctrl.cc || true

echo
echo "## packet_access.hh: locate setLE/getLE/setRaw/getRaw templates"
rg -n "template <typename T>.*setLE|setLE<|Packet::setLE|template <typename T>.*getLE|Packet::getLE|getRaw|setRaw" src/mem/packet_access.hh src/mem/packet.hh || true

echo
echo "## packet_access.hh: print first 260 lines (includes setLE/setRaw likely)"
awk '{printf "%5d\t%s\n", NR, $0}' src/mem/packet_access.hh | sed -n '1,260p'

Repository: OpenXiangShan/GEM5

Length of output: 9485


Handle sub-word MMIO reads explicitly.

HartCtrl::read() allows pkt->getSize() of 1/2/4/8 bytes, but it always calls pkt->setLE(hartResetState[tid]) (template-deduced as uint64_t), and Packet::setRaw<T> asserts sizeof(T) <= pkt->getSize()—so 1/2/4-byte reads will trip the assertion (or corrupt memory if asserts are compiled out). Dispatch on pkt->getSize() like write() and serialize the matching width.

💡 Proposed fix
-    pkt->setLE(hartResetState[tid]);
+    const uint64_t value = hartResetState[tid];
+    switch (pkt->getSize()) {
+      case sizeof(uint8_t):
+        pkt->setLE<uint8_t>(static_cast<uint8_t>(value));
+        break;
+      case sizeof(uint16_t):
+        pkt->setLE<uint16_t>(static_cast<uint16_t>(value));
+        break;
+      case sizeof(uint32_t):
+        pkt->setLE<uint32_t>(static_cast<uint32_t>(value));
+        break;
+      case sizeof(uint64_t):
+        pkt->setLE<uint64_t>(value);
+        break;
+      default:
+        panic("Unsupported HartCtrl read size %u\n", pkt->getSize());
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dev/riscv/hart_ctrl.cc` at line 24, HartCtrl::read() currently always
calls pkt->setLE(hartResetState[tid]) as a uint64_t which will fail for
1/2/4-byte MMIO reads; update HartCtrl::read to switch on pkt->getSize()
(1,2,4,8) and call pkt->setLE with the matching-width type (uint8_t, uint16_t,
uint32_t, uint64_t) or use the appropriate Packet::setRaw<T>/serialize path for
each width, keeping the existing size assert and preserving endianness handling
so sub-word reads are correctly serialized without triggering Packet assertions
or corrupting memory.

Change-Id: If3999d5c26bf325cfb052b100456e759388c2330
@github-actions
Copy link
Copy Markdown

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.3130 -
This PR 2.2842 📉 -0.0288 (-1.25%)

✅ Difftest smoke test passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants