Kmh smt rebase#863
Conversation
Co-authored-by: mohaonan <mo.haonan1@sanechips.com.cn>
Change-Id: I7690e69545b01ca4a8ba3e751f6cab7665f8767e
Change-Id: Ifbeb947f5f27ebdc9dc39dcfe0172eaa308f8e6f
Change-Id: I146d1ac20d06015e98713f30bae71fef3f5d7bcf
Change-Id: Icc05a7320ee5bf1495ef98694c3e92847613d79e
Co-authored-by: mo haonan <mohaonan@node024.bosccluster.com>
Change-Id: Ic416b537a4f2c87059c92c7b5be81618b1898e22
Change-Id: Ieb232b296c7c99ea216c14c23f135e6e081870a6
Change-Id: I11b460b6a6554998d052a020a02d84eb2b0664ad
Change-Id: Ice8e66d841a40e8c8420bd4756237eb9399d1642
Use per-thread committed instruction counts for O3 warmup and stat-dump stop checks in SMT mode, instead of summing instructions across threads. Change-Id: I6ecd5f96a18ce9aa96d0712a9e05f3d8dedcbac4
Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: Idea57378fbe47dedd654f168141d0284faee716f
Change-Id: Ic49b3e7ab82a32b81427026ce8d185e7ebeaba76
Change-Id: I4ee24d52f8ffe8fdc0a2086d97bcbc9be10860cb
Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I653f16b0adefdfcc978f54791e21adbf74ecd84e
Change-Id: I8060ec785799940c4190fdba32b349226f44795d
Change-Id: Ia8509f4290de8b2c7e337892506c69a8d399d493
…buffer resource allocation mechanism; 3. Vectorize waitForVsetvl (#844) Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I30d0a510565c278349b4d4b915e84ead21f8c6ae
Change-Id: Id6528ad1ec2b2ad7a26cd0ef18030b8f671c3302
Change-Id: Ica716ed754083ab470c1eace992837b454547b55
Carry the triggering request context through prefetch metadata and final HardPFReq creation. Use context-aware prefetch queue matching so SMT threads with the same virtual address do not squash or deduplicate each other's prefetches. Validation: - scons build/RISCV/gem5.opt -j16 - git diff --check Change-Id: I6a40826b47bff2e7ee8c7748eecd8622b44ca3c3
Use per-instruction dispatch age for IQ ordering and remove the old SMT-specific selector plumbing. Change-Id: I06f4cec1fcbe910652272cf1caf9366bdbbea171
…est priority. (#847) Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I1078cf7423248c619faeb4fde7f0a210a7d02b77
Change-Id: I735b9557f8f8d69f094121bb6229bd3ab7682f49
Change-Id: Ie6f7cfc4bd2d2d8f7aad6c9e7bace19534b77c00
Change-Id: Iccef736835125a5ef35efb06c3eeced41f189b6c
…adlock (#858) Co-authored-by: mo haonan <mohaonan@node023.bosccluster.com>
Change-Id: I615bad73ac28ceea5e51bd1f594f94acda1f7705
Change-Id: I50ab39ce30eebfb6d129c2c0fafd8f855b536730
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds FS-SMT support across configs, per-thread difftest and golden-memory, per-thread O3 pipeline/front-end and backend changes, shared LSQ/storebuffer with generation/visibility, ASID-aware branch predictors, HartCtrl MMIO device, vector-load agnostic fills, and prefetch context propagation. ChangesFS-SMT Enablement and Per-Thread Refactors
Sequence Diagram(s)sequenceDiagram
participant Fetch
participant Predictor
participant FTQ
participant Decode
participant LSQ
Fetch->>Predictor: putPCHistory(tid, asidHash)
Predictor-->>Fetch: stagePreds[tid] with metas
Fetch->>FTQ: enqueue targets (shared/partitioned FTQ)
FTQ-->>Decode: dequeue per-thread entries
Decode->>LSQ: load/store requests (per-thread)
LSQ-->>Decode: backpressure / store-forwarding info
Estimated code review effort Possibly related PRs
Suggested labels Suggested reviewers
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cpu/base.cc (1)
743-749:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve the new per-thread difftest replay state on CPU handover.
takeOverFrom()only transfersdiffAllStates, but the new reconciliation state lives inrecentCommittedStoresandsyncVisibleStoreReplayArmed. After a CPU switch, the replacement core loses the committed-store history and replay arm bits that the new load/store mismatch recovery depends on, so difftest can diverge immediately after handover.Suggested fix
if (enable_diff) { warn("Take over difftest state to new CPU\n"); enableDifftest = enable_diff; takeOverDiffAllStates(std::move(diff_all)); + recentCommittedStores = std::move(oldCPU->recentCommittedStores); + syncVisibleStoreReplayArmed = + std::move(oldCPU->syncVisibleStoreReplayArmed); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/base.cc` around lines 743 - 749, The CPU handover currently only transfers diffAllStates via takeOverDiffAllStates, but it must also preserve per-thread reconciliation state: copy recentCommittedStores and the syncVisibleStoreReplayArmed flag from oldCPU to this CPU when enable_diff is true; update the same handover site where getDiffAllStates() and takeOverDiffAllStates(std::move(diff_all)) are called to call the appropriate accessors/mutators (e.g., recentCommittedStores and syncVisibleStoreReplayArmed on the source CPU and the destination CPU) so the replacement core retains committed-store history and replay-arm bits after takeOverFrom/takeOverDiffAllStates.src/cpu/pred/btb/ras.cc (1)
20-27:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAvoid undefined behavior in
maxCtrinitialization (BTBRAS)
src/cpu/pred/btb/ras.ccsetsmaxCtr((1 << ctrWidth) - 1)in both constructors;(1 << ctrWidth)is done asint(since1is anintliteral) andmaxCtris alsoint(src/cpu/pred/btb/ras.hh).ctrWidthis aParam.Unsignedthat is only range-checked to 32-bit unsigned (0..2^32-1), so a config withctrWidth >= 31can overflow/trigger undefined shift behavior. Use a wide unsigned literal/cast and guard the shift width (or widen/changemaxCtr’s type).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/ras.cc` around lines 20 - 27, The initialization of maxCtr in the BTBRAS constructors uses (1 << ctrWidth) which shifts a 32-bit int and can UB when ctrWidth >= 31; update the BTBRAS constructors to compute maxCtr using a widened unsigned type and guard the shift: e.g., perform the shift with 1ULL (or static_cast<uint64_t>(1)) and cap/validate ctrWidth against the bit-width before shifting, or change maxCtr's type to an unsigned wide type (e.g., uint64_t) in ras.hh and compute maxCtr as ((ctrWidth >= (sizeof(maxCtr)*8)) ? std::numeric_limits<decltype(maxCtr)>::max() : ((decltype(maxCtr)1 << ctrWidth) - 1)); adjust both BTBRAS(...) initializers accordingly.
🟠 Major comments (23)
src/cpu/base.cc-911-915 (1)
911-915:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the resolved CPU id when deriving difftest hart ids.
Line 914 uses
params().cpu_id, but this constructor can auto-fix invalid IDs by writing_cpuIdearlier in the file. If Python leavescpu_idunset, difftest still gets the stale sentinel here and assigns the wrong hart id.Suggested fix
int BaseCPU::difftestHartId(ThreadID tid) const { - return params().cpu_id * numThreads + tid; + return _cpuId * numThreads + tid; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/base.cc` around lines 911 - 915, The difftest hart id calculation in BaseCPU::difftestHartId uses params().cpu_id which may be the unresolved sentinel; change it to use the resolved member CPU id (the field set earlier, e.g., _cpuId or whatever member holds the fixed CPU id) so the function returns (_cpuId * numThreads + tid) instead of (params().cpu_id * numThreads + tid); update the reference in difftestHartId to the resolved ID member to ensure correct hart assignment.src/cpu/o3/commit.cc-1231-1237 (1)
1231-1237:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThis loop bypasses
smtCommitPolicy.The new path iterates
*activeThreadsdirectly and never consultsgetCommittingThread(),roundRobin(), oroldestReady(). After this change, non-defaultsmtCommitPolicysettings stop affecting commit scheduling.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/commit.cc` around lines 1231 - 1237, The loop that iterates *activeThreads directly bypasses the SMT commit policy; replace the direct iteration with the policy-driven selection used elsewhere by calling the existing getCommittingThread / smtCommitPolicy selection path (e.g., use getCommittingThread(), roundRobin(), or oldestReady() as the codebase expects) so commit scheduling honors smtCommitPolicy. Specifically, stop looping over activeThreads in commit.cc and instead obtain each next commit_thread via the smtCommitPolicy/getCommittingThread mechanism and then check commitStatus[commit_thread] before committing, preserving the original policy-driven ordering and termination condition.src/cpu/o3/decode.cc-90-97 (1)
90-97:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAccount for the new stall buffer in reset and drain paths.
stallBufferandeachstallSizebecome persistent decode state here, butclearStates(),isDrained(), anddrainSanityCheck()still only reason aboutfixedbuffer. After a thread removal or CPU drain, stale fetch groups can survive in decode and be replayed later.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/decode.cc` around lines 90 - 97, clearStates(), isDrained(), and drainSanityCheck() currently only handle fixedbuffer and ignore the new decode state stallBuffer and eachstallSize; update those methods to account for them by clearing stallBuffer and eachstallSize in clearStates(), including them in the emptiness/drain checks in isDrained(), and verifying they are empty in drainSanityCheck() (similar to how fixedbuffer is handled) so stale fetch groups cannot survive thread removal or CPU drain.src/cpu/o3/commit.cc-1227-1229 (1)
1227-1229:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't mark the CPU active on zero-work commit cycles.
commit_width >= 0is always true, so commit now callsactivityThisCycle()even when no thread can retire anything. That can keep an empty core from going idle.Suggested fix
- if (commit_width >= 0) { + if (commit_width > 0) { cpu->activityThisCycle(); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/commit.cc` around lines 1227 - 1229, The code calls cpu->activityThisCycle() whenever commit_width >= 0 which is always true; change the condition inside the commit routine so activityThisCycle() is only invoked when there was actual work retired (e.g., use commit_width > 0 or, better, check the actual number of instructions retired this cycle such as a local num_committed/count_retired variable) and call cpu->activityThisCycle() only when that count > 0; update the condition around cpu->activityThisCycle() in commit (the block referencing commit_width and cpu->activityThisCycle()) accordingly.src/cpu/o3/commit.cc-1462-1485 (1)
1462-1485:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the 32-bit VSSTATUS view consistently in the
vsxl == 1path.Line 1475 mixes
vsstatus32.fswithvsstatus.vs. That recomputessdfrom two different register views and can write the wrong architecturalVSSTATUS.sdbit.Suggested fix
- vsstatus32.sd = (vsstatus32.fs == 3) || (vsstatus.vs == 3); + vsstatus32.sd = (vsstatus32.fs == 3) || (vsstatus32.vs == 3);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/commit.cc` around lines 1462 - 1485, In the head_inst->isUpdateVsstatusSd() block, the vsxl == 1 path mixes the 32-bit view and 64-bit view (uses vsstatus32.fs but vsstatus.vs) causing incorrect SD computation; change the computation to use only the 32-bit view (vsstatus32.fs and vsstatus32.vs) and ensure you write back (RegVal)vsstatus32 via cpu->setMiscRegNoEffect(MISCREG_VSSTATUS, ... , tid) so the SD bit is derived and stored consistently from the VSSTATUS32 view.src/cpu/o3/commit.cc-1214-1241 (1)
1214-1241:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep total retirement bounded by
commitWidth.This now computes a local quota of up to
commitWidthfor each active thread and then sums them intocommit_width. With 2 runnable threads, commit can retire up to2 * commitWidthin one cycle, which breaks the stage bandwidth contract and also skewscommitEligibleSamples.Suggested fix
- int commit_width = 0; + unsigned commit_width = 0; for (ThreadID tid : *activeThreads) { - commit_width_per_thread[tid] = - rob->countInstsOfGroups(tid, commitWidth); + const unsigned remaining = commitWidth - commit_width; + commit_width_per_thread[tid] = + rob->countInstsOfGroups(tid, remaining); commit_width += commit_width_per_thread[tid]; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/commit.cc` around lines 1214 - 1241, The code currently sums per-thread quotas into commit_width allowing up to N * commitWidth retirements; instead enforce a global cap of commitWidth by adjusting commit_width_per_thread after computing per-thread counts: compute each thread's local quota via rob->countInstsOfGroups(tid, commitWidth), then if the sum of those quotas exceeds commitWidth, reduce quotas (e.g., iterate threads and decrement per-thread quotas or allocate quotas round-robin/proportionally) until the total <= commitWidth, set commit_width to that total, and keep the rest of the commit loop using num_committed and num_committed_per_thread as before so the outer while(num_committed < commit_width && ...) never allows more than commitWidth retirements; update any bookkeeping that relies on commit_width or commitEligibleSamples accordingly.src/cpu/o3/decode.cc-188-196 (1)
188-196:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSize the SMT stats with the configured thread count.
These vectors are indexed by
tidintick()/decodeInsts(), but they are always initialized to 4 entries. Any config with more than 4 hardware threads will index past the registered stats size.Suggested fix
- smtidleCycles - .init(4) + smtidleCycles + .init(cpu->numThreads) .flags(statistics::total); smtblockedCycles - .init(4) + .init(cpu->numThreads) .flags(statistics::total); smtnotactiveCycles - .init(4) + .init(cpu->numThreads) .flags(statistics::total);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/decode.cc` around lines 188 - 196, smtidleCycles, smtblockedCycles, and smtnotactiveCycles are sized to a hardcoded 4 but are indexed by tid in tick()/decodeInsts(), causing OOB for configs with >4 hardware threads; change their .init(4) calls to initialize with the configured SMT thread count (use the CPU/decoder's SMT thread count accessor—e.g., numThreads(), threads(), or threadContexts().size() used elsewhere in this module) so the stats vector length matches the actual number of hardware threads.src/cpu/o3/fetch.hh-249-251 (1)
249-251:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInitialize the scheduler knobs from params instead of defaults here.
initDecodeScheduler()now branches on these members, but in the provided codesmtDecodePolicyis effectively hard-wired to"multi_priority"anddelayedSchedulerDelayhas no deterministic value. That means the new fetch scheduler is not actually configurable, and the"delayed"path will read garbage if it ever becomes reachable.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/fetch.hh` around lines 249 - 251, smtDecodePolicy and delayedSchedulerDelay are not initialized from the configuration and are using a hardwired default/garbage; update their initialization to read from the Params (e.g., the Fetch/CPU params) in the class constructor or member initializer list so initDecodeScheduler() sees the configured values — locate the member declarations smtDecodePolicy and delayedSchedulerDelay in fetch.hh and set them from the appropriate Params fields (the same names used in the param definition), or call the param accessor in the constructor before initDecodeScheduler() is invoked.src/cpu/o3/fetch.cc-1183-1194 (1)
1183-1194:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFree late I-cache responses after squash.
Dropping queued retry packets here does not cover requests that were already accepted by the I-cache. Once
cacheReq.reset()runs, those late responses hit the unmatched-request path inprocessMultiCacheLineCompletion()and return without deletingpkt, so repeated squash/retry traffic leaksPacketobjects.Possible fix
@@ bool found_packet = threads[tid].cacheReq.markCompletedAndStorePacket(pkt); if (!found_packet) { + delete pkt; + ++fetchStats.icacheSquashes; DPRINTF(Fetch, "[tid:%i] Packet doesn't match current requests, deleting pkt %#lx\n", tid, pkt->getAddr()); DPRINTF(Fetch, "[tid:%i] Expected requests: ", tid); for (size_t i = 0; i < threads[tid].cacheReq.requests.size(); i++) { DPRINTF(Fetch, "req[%d]=0x%lx ", i, threads[tid].cacheReq.requests[i]->getVaddr());🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/fetch.cc` around lines 1183 - 1194, The squash path currently deletes queued retryPkt entries but leaves Packets already accepted by the I-cache dangling (they hit the unmatched-request branch in processMultiCacheLineCompletion() and are never deleted), so after cacheReq.reset() ensure those late I-cache responses are freed: either (a) on squash walk any container tracking outstanding accepted-cache Packets and delete/erase them (similar to the retryPkt loop), or (b) update processMultiCacheLineCompletion() to detect a squashed/unmatched response (no cacheReq or thread mismatch) and delete the incoming Packet before returning; reference symbols: retryPkt, cacheReq.reset(), processMultiCacheLineCompletion(), Packet, cacheBlocked, cpu->contextToThread(), tid.src/cpu/o3/iew.cc-1611-1614 (1)
1611-1614:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClear
resolvedCFIsfor every thread before repopulating it.Fetch now consumes
fromIEW->iewInfo[tid].resolvedCFIsper thread, but this function still clears that vector only for*activeThreads->begin()a few lines below. Any other active thread retains last cycle's resolved CFIs and fetch will keep re-enqueueing stale resolve updates for it.Suggested change
while (threads != end) { ThreadID tid = *threads++; fetchRedirect[tid] = false; toFetch->iewInfo[tid].ldstqCount = ldstQueue.getCount(tid); toFetch->iewInfo[tid].robCount = rob->getThreadEntries(tid); toFetch->iewInfo[tid].iqCount = scheduler->getIQInsts(tid); + toFetch->iewInfo[tid].resolvedCFIs.clear(); } - - // Clear resolvedFSQId and resolvedInstPC since they are already handled in frontend - ThreadID tid = *activeThreads->begin(); - toFetch->iewInfo[tid].resolvedCFIs.clear();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/iew.cc` around lines 1611 - 1614, The code only clears resolvedCFIs for *activeThreads->begin(), leaving other threads' fromIEW->iewInfo[tid].resolvedCFIs stale; fix by iterating all tids in activeThreads and calling resolvedCFIs.clear() (or equivalent) on each fromIEW->iewInfo[tid] (or toFetch->iewInfo[tid] as appropriate) before repopulating counts; place this loop alongside where you set ldstqCount/robCount/iqCount so every active thread's resolvedCFIs is reset each cycle.src/cpu/o3/lsq.cc-1954-1958 (1)
1954-1958:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't collapse independent-mode stall state into a global stall.
With
SMTLSQMode::Independent,lsqPolicycan still beDynamic, so this makes one thread's replay stall the other. The global path should only apply in shared mode.Suggested fix
- if (lsqPolicy == SMTQueuePolicy::Dynamic || - lsqPolicy == SMTQueuePolicy::DynamicBorrowing) + if (sharedLSQMode() && + (lsqPolicy == SMTQueuePolicy::Dynamic || + lsqPolicy == SMTQueuePolicy::DynamicBorrowing)) return isStalled(); else return thread[tid].isStalled();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/lsq.cc` around lines 1954 - 1958, The current branch returns the global isStalled() whenever lsqPolicy is Dynamic/DynamicBorrowing, which incorrectly collapses per-thread stalls when SMTLSQMode is Independent; update the condition to only return global isStalled() when lsqPolicy is Dynamic or DynamicBorrowing AND the LSQ mode is not SMTLSQMode::Independent (otherwise return thread[tid].isStalled()); refer to lsqPolicy, SMTQueuePolicy::Dynamic, SMTQueuePolicy::DynamicBorrowing, SMTLSQMode::Independent, isStalled(), and thread[tid].isStalled() when making the change.src/cpu/o3/lsq_unit.cc-845-857 (1)
845-857:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate the queue slot before returning the “current” request.
These helpers only check
lqIdx/sqIdx >= 0, but replay/squash/commit can leave a stale index that now points at a recycled LSQ slot. That can return another instruction's request and reintroduce the exact stale-request bug this change is trying to remove. Guard withisValidIdx(),valid(), and aseqNummatch before handing the pointer out.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/lsq_unit.cc` around lines 845 - 857, currentLoadRequest and currentStoreRequest currently only check lqIdx/sqIdx >= 0 and can return a recycled slot; update both to validate the indexed slot before returning: for currentLoadRequest check inst && inst->lqIdx >= 0 && isValidIdx(inst->lqIdx) && loadQueue[inst->lqIdx].valid() && loadQueue[inst->lqIdx].seqNum == inst->seqNum (then return request()), otherwise return nullptr; do the analogous checks in currentStoreRequest using sqIdx, storeQueue, and the slot's seqNum to ensure the slot still belongs to this instruction. Ensure you reference the loadQueue/storeQueue slot validation helpers (isValidIdx, valid) and compare seqNum on the slot against inst->seqNum.src/cpu/o3/lsq_unit.cc-1076-1077 (1)
1076-1077:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
replay_executed_loadsis ignored, so visible atomic/sync stores never replay completed loads.
completeStore()computesreplay_executed_loadsspecifically for atomic and sync-visible stores, but this helper immediately marks it unused and always skipsld_inst->isExecuted(). That drops the path meant to invalidate already-issued conflicting loads after another hart makes a store visible.Also applies to: 1107-1123
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/lsq_unit.cc` around lines 1076 - 1077, The code currently marks replay_executed_loads and visible_store_seq as [[maybe_unused]] and always skips executed loads (using ld_inst->isExecuted()), which prevents atomic/sync-visible stores from replaying already-executed conflicting loads; update completeStore() to stop ignoring replay_executed_loads (remove the [[maybe_unused]]), and modify the load-replay condition where ld_inst->isExecuted() is checked so that executed loads are considered for replay when replay_executed_loads is true (e.g., only skip executed loads if replay_executed_loads is false); apply the same fix to the corresponding load-handling code paths noted around the 1107–1123 region so visible_store_seq and replay_executed_loads drive replay behavior correctly.src/cpu/o3/rename.cc-430-432 (1)
430-432:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't index
smtStallEventswith rawStallReason.
smtStallEventsis sized byStallEventCount, and the code above already mapsStallReasoninto that smaller enum. Indexing it directly withstallSig->renameBlockReason[tid]can hit the wrong bucket—or walk past the vector for reasons likeOtherFragStallor bandwidth-specific stalls.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/rename.cc` around lines 430 - 432, The code indexes stats.smtStallEvents with a raw StallReason (stallSig->renameBlockReason[tid]) which can be out-of-range because smtStallEvents is sized by StallEventCount; instead map the StallReason into the reduced StallEvent enum before indexing. In the rename.cc block around the if (stallSig->blockRename[tid]) branch, replace direct use of stallSig->renameBlockReason[tid] when calling setAllStalls and stats.smtStallEvents[...] with the mapped event value (the function or mapping that converts StallReason → StallEventCount-based enum), then use that mapped enum to call setAllStalls and stats.smtStallEvents[mappedEvent].sample(tid).src/cpu/o3/lsq_unit.cc-365-368 (1)
365-368:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't assert response size against
inst->effSizefor split loads.For split loads,
completeDataAccess()runs once per fragment, sopkt->getSize()can be smaller than the full instruction size. This will fire as soon as a split load in pmem goes through the multi-context difftest path; compare against the fragment size instead, or gate the equality check on!request->isSplit().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/lsq_unit.cc` around lines 365 - 368, The assert comparing pkt->getSize() to inst->effSize is invalid for split loads; in completeDataAccess() replace the unconditional assert(size == inst->effSize) with a check that uses the fragment size for split requests (e.g. use request->getSize() when request->isSplit() is true) or gate the equality on !request->isSplit(); i.e. ensure you compare size to request->getSize() for fragments or to inst->effSize only for non-split requests (refer to pkt->getSize(), inst->effSize, request->isSplit(), and request->getSize()).src/cpu/o3/rob.hh-233-239 (1)
233-239:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
isFull(tid)is checking total ROB size instead of the thread limit.The non-
DynamicBorrowingpath comparesthreadGroups[tid].size()againstnumEntries, but partitioned and threshold modes cap each thread atmaxEntries[tid]. That will report "not full" long after the thread has exhausted its own quota, so upstream stages can over-issue until a later assertion fires.Suggested fix
- return threadGroups[tid].size() == numEntries; + return threadGroups[tid].size() >= maxEntries[tid];🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/rob.hh` around lines 233 - 239, isFull(ThreadID tid) currently returns threadGroups[tid].size() == numEntries for non-DynamicBorrowing modes which checks against the total ROB size; change it to check the per-thread limit by comparing threadGroups[tid].size() (or equivalently numFreeEntries(tid)) against maxEntries[tid] so partitioned/threshold modes enforce each thread's quota; keep the existing DynamicBorrowing branch (robPolicy == SMTQueuePolicy::DynamicBorrowing) that uses numFreeEntries(tid) == 0 and replace the fallback comparison to use maxEntries[tid] instead of numEntries.src/cpu/pred/btb/btb_mgsc.cc-891-895 (1)
891-895:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTrain MGSC weights with the same ASID-aware key used for lookup.
findWeight()now reads from a PC+ASID bucket, but the training path still ignores that key:weightTableIdxis computed here and thenupdateWeightTable()re-hashes by PC without ASID. OnceasidHash != 0, prediction and training hit different entries, so the active address space never converges its weights.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/btb_mgsc.cc` around lines 891 - 895, The training path computes a PC-only index but should use the same PC+ASID key used for lookup; change the code that computes weightTableIdx in the training branch to call getPcIndex(stream.startPC, weightTableIdxWidth, stream.asidHash) (same parameters used by findWeight()) and pass that index into updateWeightTable (or change updateWeightTable to accept an externally computed index) so updateWeightTable does not re-hash by PC alone; ensure all calls to updateWeightTable from this training path use the ASID-aware index so prediction and training access the same bucket when stream.asidHash != 0.src/cpu/pred/btb/abtb.cc-398-438 (1)
398-438:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the ahead-pipeline queue keyed to the live BTB set.
This path still snapshots a full
BTBSetintoaheadReadBtbEntries. When the queued lookup is consumed, hit handling updates the copied entries while the real set inbtb[idx_prvStartpc]keeps its old recency state, so normal hits stop refreshing the live replacement metadata. Please queue only the prior block/set identity and re-read the live set when the ahead stage matures.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/abtb.cc` around lines 398 - 438, The ahead-pipeline currently snapshots a full BTBSet into state.aheadReadBtbEntries in AheadBTB::lookupSingleBlock which causes stale recency metadata; instead push only the identifying info (e.g., prior block PC and its btb index/ASID hash) into state.aheadReadBtbEntries and when the entry is popped re-read the live BTB set from btb[idx_prvStartpc] (or via getIndex/getTag as appropriate) before doing tag compares and hit handling; update uses of std::tie(pc, idx_prvStartpc, set) so the popped tuple contains no BTBSet, then fetch BTBSet set = btb[idx_prvStartpc] (or equivalent) before iterating/printing or updating entries (affecting functions/variables: AheadBTB::lookupSingleBlock, state.aheadReadBtbEntries, btb, idx_prvStartpc, getIndex/getTag, aheadPipelinedStages, printTickedBTBEntry).src/cpu/pred/btb/btb_tage_ub.cc-325-327 (1)
325-327:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftMake the upper-bound predictor state truly thread-local.
Discarding
tidhere leaves the new thread-aware API as a no-op, while this class still keeps a singleexactPathHistory. InHistorySource::PathHashmode, different SMT threads will overwrite each other’s speculative path history, so update/recovery can run against the wrong context.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/btb_tage_ub.cc` around lines 325 - 327, BTBTAGEUpperBound::getPredictionMeta currently ignores the ThreadID and shares a single exactPathHistory across threads, causing SMT threads to overwrite each other's speculative path history (HistorySource::PathHash mode); make the upper-bound predictor state truly thread-local by storing per-thread exactPathHistory (e.g., a vector/array or thread-local map keyed by ThreadID) and change BTBTAGEUpperBound::getPredictionMeta to index into that per-thread slot using tid so updates and recoveries use the correct context for each thread.src/cpu/pred/btb/decoupled_bpred.cc-567-575 (1)
567-575:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing-target squash only rewinds the PC, not the thread history.
This path clears the FTQ and redirects
s0PC, but it leaves the thread’s speculative histories andhistoryManagers[tid]untouched. The next prediction on that thread will reuse stale history, and the folded-history predictors will diverge from the redirected control flow.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/decoupled_bpred.cc` around lines 567 - 575, The squash-on-missing-target path rewinds only FTQ and s0PC but leaves per-thread speculative histories intact, so update the handler (after ftq.clear(tid); clearPreds(tid); threads[tid].validprediction = false; threads[tid].s0PC = redirect_pc;) to also rewind/clear the thread’s speculative history and folded-history manager state for tid: invoke the appropriate reset/clear/squash API on historyManagers[tid] and any per-thread speculative-history structures so the folded-history predictors are synchronized with the new redirect_pc and cannot reuse stale history.src/cpu/pred/btb/mbtb.cc-302-305 (1)
302-305:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep victim-cache identity ASID-aware on the update path too.
These changes make MBTB lookup/tagging ASID-aware, but the victim-cache write/dedup path still matches entries by
pconly. That lets same-PC branches from different ASIDs overwrite or erase each other in the victim cache, so a later update can corrupt another address space's predictor state.Suggested direction
- if (vc_entry.valid && vc_entry.pc == entry.pc) { // pc is tag compared + const Addr entry_tag = getTag(entry.pc, stream.asidHash); + if (vc_entry.valid && vc_entry.pc == entry.pc && + vc_entry.tag == entry_tag) {Apply the same ASID-aware keying in
insertVictimCache()anderaseFromVictimCacheByPC()so lookup, update, insert, and dedup all agree on what constitutes the same victim-cache entry.Also applies to: 464-465, 512-568, 727-740
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/mbtb.cc` around lines 302 - 305, The victim-cache update/dedup path is currently matching entries by pc only, which allows entries from different ASIDs to collide; update insertVictimCache() and eraseFromVictimCacheByPC() to use the same ASID-aware keying as lookup() (use the asidHash computed from stagePreds.front().asidHash or the same metadata path) so all victim-cache operations compare both pc and asidHash when matching entries; ensure any dedup/erase logic uses the combined (pc, asidHash) identity so victim entries for different ASIDs cannot overwrite or be erased by each other.src/cpu/valuepred/enhanced_stride.cc-125-128 (1)
125-128:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix
inflightWindowTagLengthvalidation and 64-bit-safe mask shifts to avoid UB
gem5_assert(params.inflightWindowTagLength, ...)only checks non-zero; negative values pass—assertparams.inflightWindowTagLength > 0.enhanced_stride.ccbuilds masks with expressions like((1 << targetLength) - 1)/((1 << tagWidth) - 1). WithValuePredictor.pydefaultinginflightWindowTagLength = Param.Int(64, ...), shifting by 64 with a 32-bit1can trigger undefined behavior—cap the parameter (and/or use auint64_t-based mask with alen >= 64fallback).Suggested fix
@@ - gem5_assert(params.inflightWindowTagLength, "EStride inflightWindowTagLength must > 0 \n"); + gem5_assert(params.inflightWindowTagLength > 0, + "EStride inflightWindowTagLength must > 0 \n"); + gem5_assert(params.inflightWindowTagLength <= 64, + "EStride inflightWindowTagLength must <= 64 \n");Update the mask generation to use
uint64_t(1)(and handlelen >= 64as~uint64_t(0)), so((1ULL << len) - 1)is never evaluated withlen == 64.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/valuepred/enhanced_stride.cc` around lines 125 - 128, The assert for the inflight window tag length is too weak and the mask math uses 32-bit shifts causing UB for 64; change the check that validates params.inflightWindowTagLength to require > 0 (not just non-zero) and update mask construction in enhanced_stride.cc to use uint64_t (e.g., uint64_t(1) or 1ULL) and handle len >= 64 by returning ~uint64_t(0) (all-ones) instead of performing (1ULL << 64) - 1; ensure every use of inflightWindowTagLength or tagWidth that builds a mask follows this uint64_t-safe pattern so shifts never invoke UB.src/mem/physical.cc-855-857 (1)
855-857:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestore loop may leave stale bytes for zero words
At Line 855, the new
read_data != 0gate skips writing zero words, so previously non-zero destination bytes can survive restore. This can corrupt reconstructed memory when backing storage isn’t guaranteed zeroed.💡 Suggested fix
for (uint64_t x = 0; x < output.pos; x += sizeof(long)) { uint64_t read_data = *(decompress_file_buffer + x / sizeof(long)); - if (read_data != 0) { - pmem_current = (uint64_t*)(pmem + total_write_size + x); + pmem_current = (uint64_t*)(pmem + total_write_size + x); + if (read_data != 0 || *pmem_current != 0) { *pmem_current = read_data; - non_zero_dword++; + if (read_data != 0) { + non_zero_dword++; + } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/mem/physical.cc` around lines 855 - 857, The restore loop currently skips writing when read_data == 0 which can leave previously non-zero bytes in pmem; change the logic in the restore loop around variables read_data, pmem_current, pmem, total_write_size and x so that you always write the word to the destination (i.e., remove the read_data != 0 gate) or, if you keep the branch, explicitly write zero for the zero case (e.g., *pmem_current = 0), ensuring no stale bytes remain after restore.
🟡 Minor comments (4)
configs/common/Options.py-352-353 (1)
352-353:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix typo in
--smthelp text.The help string has a typo on Line 353 (
multitThread-supported), which shows up in CLI docs/help output.Suggested fix
- parser.add_argument("--smt", action="store_true", default=False, - help=""" RISCV SMT support, which requires multitThread-supported gcpt restore and diff-ref-so""") + parser.add_argument("--smt", action="store_true", default=False, + help="""RISC-V SMT support, which requires multi-thread-supported gcpt restore and diff-ref-so""")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@configs/common/Options.py` around lines 352 - 353, Update the help text for the parser.add_argument("--smt", action="store_true", default=False, help=...) entry to fix the typo "multitThread-supported" — change it to "multithread-supported" (or "multi-thread-supported" if you prefer hyphenation) so the help string reads e.g. "RISCV SMT support, which requires multithread-supported gcpt restore and diff-ref-so".src/cpu/o3/commit.cc-115-125 (1)
115-125:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse the current thread when dumping a stuck ROB head.
Line 115 still reads
rob->readHeadInst(0)inside the per-thread loop, so a stall on threadtid != 0reports thread 0's head or incorrectly says the ROB is empty.Suggested fix
- if (auto inst = rob->readHeadInst(0)) { + if (auto inst = rob->readHeadInst(tid)) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/o3/commit.cc` around lines 115 - 125, The code is calling rob->readHeadInst(0) inside a per-thread loop which uses tid, causing dumps to show thread 0's head; change the call to use the current thread variable (tid) instead of 0 so readHeadInst(tid) retrieves the correct thread's head, and likewise ensure getInstList(tid) and the debug_insts population/dump logic remain consistent with tid when building and printing the stalled-ROB instruction list (symbols: readHeadInst, rob, tid, getInstList, debug_insts).src/cpu/pred/btb/btb_tage.cc-472-473 (1)
472-473:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove the unreachable empty-vector fallback.
Line 472 already asserts
stagePredsis non-empty viapredictorTid(stagePreds), so thestagePreds.empty()fallback on Line 473 can never run. If an empty vector is legal here, this still aborts; otherwise simplify the code to a single non-empty contract.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/btb_tage.cc` around lines 472 - 473, The fallback that checks stagePreds.empty() is unreachable because predictorTid(stagePreds) already requires a non-empty vector; replace the ternary with a direct access to stagePreds.front().asidHash (i.e., set asidHash := stagePreds.front().asidHash) and remove the empty-vector branch. If empty stagePreds should be considered legal, instead add an explicit guard/early return or assert before calling predictorTid and compute asidHash only when non-empty; otherwise keep the simplified non-empty contract and remove the redundant check.src/cpu/valuepred/valuepred_unit.hh-30-30 (1)
30-30:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRename
assertValidTidtoassert_valid_tidto match lower_snake_case.
assertValidTidappears insrc/cpu/valuepred/valuepred_unit.hhand is defined insrc/cpu/valuepred/valuepred_unit.cc, with call sites insrc/cpu/valuepred/ideal_constant_lvp.ccandsrc/cpu/valuepred/enhanced_stride.cc; update the declaration, definition, and all call sites toassert_valid_tid.Suggested rename
- void assertValidTid(ThreadID tid) const; + void assert_valid_tid(ThreadID tid) const;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/valuepred/valuepred_unit.hh` at line 30, The declaration and uses of assertValidTid need to be renamed to lower_snake_case: change the function declaration and definition from assertValidTid(...) to assert_valid_tid(...), update its definition in valuepred_unit.cc, and update all call sites (e.g., calls in ideal_constant_lvp.cc and enhanced_stride.cc) to call assert_valid_tid; keep the signature and semantics unchanged and run a build to catch any remaining references.
🧹 Nitpick comments (2)
configs/example/smt_idealkmhv3.py (1)
12-13: ⚡ Quick winRename helper to snake_case for guideline compliance.
setSharedLSQParamsshould use lower_snake_case.Suggested fix
-def setSharedLSQParams(args, system): +def set_shared_lsq_params(args, system): setKmhV3IdealParams(args, system) @@ - setSharedLSQParams(args, test_sys) + set_shared_lsq_params(args, test_sys)As per coding guidelines: "Functions and methods should use lower_snake_case naming convention".
Also applies to: 45-45
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@configs/example/smt_idealkmhv3.py` around lines 12 - 13, Rename the helper function setSharedLSQParams to follow snake_case: set_shared_lsq_params; update its definition and all references (e.g., callers that invoke setSharedLSQParams and the helper call to setKmhV3IdealParams inside it) including the other occurrence noted (line 45) so they call set_shared_lsq_params, and ensure any imports or exports use the new name to keep behavior unchanged.src/cpu/pred/btb/test/abtb.test.cc (1)
44-63: ⚡ Quick winUse lower_snake_case for newly introduced helper functions.
makePredictionandclearAheadPipelineshould follow the repository naming rule for functions/methods.Suggested rename
-FullBTBPrediction makePrediction(Addr startPC, AheadBTB *abtb, - ThreadID tid = 0, uint8_t asidHash = 0) { +FullBTBPrediction make_prediction(Addr start_pc, AheadBTB *abtb, + ThreadID tid = 0, uint8_t asid_hash = 0) { @@ -void clearAheadPipeline(AheadBTB *abtb, ThreadID tid) { +void clear_ahead_pipeline(AheadBTB *abtb, ThreadID tid) {As per coding guidelines, "Functions and methods should use lower_snake_case naming convention".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cpu/pred/btb/test/abtb.test.cc` around lines 44 - 63, The helper functions violate naming conventions—rename makePrediction and clearAheadPipeline to lower_snake_case (e.g., make_prediction and clear_ahead_pipeline) keeping their signatures the same (makePrediction(Addr startPC, AheadBTB *abtb, ThreadID tid = 0, uint8_t asidHash = 0) and clearAheadPipeline(AheadBTB *abtb, ThreadID tid)), then update all references/call sites in this test file (and any headers) to the new names so the code compiles and follows the repository's function naming rule.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a29d618b-a581-4652-ae35-442a8393b911
📒 Files selected for processing (92)
.gitignoreconfigs/common/FSConfig.pyconfigs/common/Options.pyconfigs/common/xiangshan.pyconfigs/example/smt_idealkmhv3.pysrc/arch/riscv/isa/vector/base/vector_mem.temp.isasrc/arch/riscv/isa/vector/simple/vector_mem.temp.isasrc/arch/riscv/tlb.ccsrc/arch/riscv/tlb.hhsrc/cpu/base.ccsrc/cpu/base.hhsrc/cpu/difftest.ccsrc/cpu/difftest.hhsrc/cpu/o3/BaseO3CPU.pysrc/cpu/o3/FuncScheduler.pysrc/cpu/o3/SConscriptsrc/cpu/o3/comm.hhsrc/cpu/o3/commit.ccsrc/cpu/o3/commit.hhsrc/cpu/o3/cpu.ccsrc/cpu/o3/cpu.hhsrc/cpu/o3/decode.ccsrc/cpu/o3/decode.hhsrc/cpu/o3/dyn_inst.hhsrc/cpu/o3/fetch.ccsrc/cpu/o3/fetch.hhsrc/cpu/o3/iew.ccsrc/cpu/o3/iew.hhsrc/cpu/o3/inst_queue.ccsrc/cpu/o3/inst_queue.hhsrc/cpu/o3/issue_queue.ccsrc/cpu/o3/issue_queue.hhsrc/cpu/o3/lsq.ccsrc/cpu/o3/lsq.hhsrc/cpu/o3/lsq_unit.ccsrc/cpu/o3/lsq_unit.hhsrc/cpu/o3/rename.ccsrc/cpu/o3/rename.hhsrc/cpu/o3/rob.ccsrc/cpu/o3/rob.hhsrc/cpu/o3/smt_sched.hhsrc/cpu/pred/BranchPredictor.pysrc/cpu/pred/SConscriptsrc/cpu/pred/btb/abtb.ccsrc/cpu/pred/btb/abtb.hhsrc/cpu/pred/btb/btb_ittage.ccsrc/cpu/pred/btb/btb_ittage.hhsrc/cpu/pred/btb/btb_mgsc.ccsrc/cpu/pred/btb/btb_mgsc.hhsrc/cpu/pred/btb/btb_tage.ccsrc/cpu/pred/btb/btb_tage.hhsrc/cpu/pred/btb/btb_tage_ub.ccsrc/cpu/pred/btb/btb_tage_ub.hhsrc/cpu/pred/btb/btb_ubtb.ccsrc/cpu/pred/btb/btb_ubtb.hhsrc/cpu/pred/btb/common.hhsrc/cpu/pred/btb/decoupled_bpred.ccsrc/cpu/pred/btb/decoupled_bpred.hhsrc/cpu/pred/btb/ftq.ccsrc/cpu/pred/btb/ftq.hhsrc/cpu/pred/btb/mbtb.ccsrc/cpu/pred/btb/mbtb.hhsrc/cpu/pred/btb/microtage.ccsrc/cpu/pred/btb/microtage.hhsrc/cpu/pred/btb/ras.ccsrc/cpu/pred/btb/ras.hhsrc/cpu/pred/btb/test/abtb.test.ccsrc/cpu/pred/btb/test/btb_tage.test.ccsrc/cpu/pred/btb/timed_base_pred.hhsrc/cpu/pred/btb/uras.ccsrc/cpu/pred/btb/uras.hhsrc/cpu/simple/base.ccsrc/cpu/simple/base.hhsrc/cpu/valuepred/ValuePredictor.pysrc/cpu/valuepred/enhanced_stride.ccsrc/cpu/valuepred/enhanced_stride.hhsrc/cpu/valuepred/ideal_constant_lvp.ccsrc/cpu/valuepred/ideal_constant_lvp.hhsrc/cpu/valuepred/valuepred_metadata.hhsrc/cpu/valuepred/valuepred_unit.ccsrc/cpu/valuepred/valuepred_unit.hhsrc/dev/riscv/HartCtrl.pysrc/dev/riscv/SConscriptsrc/dev/riscv/hart_ctrl.ccsrc/dev/riscv/hart_ctrl.hhsrc/mem/cache/prefetch/base.ccsrc/mem/cache/prefetch/base.hhsrc/mem/cache/prefetch/queued.ccsrc/mem/cache/prefetch/worker.ccsrc/mem/physical.ccsrc/sim/system.ccsrc/sim/system.hh
| if (interrupt != NoFault) { | ||
| // If inside a transaction, postpone interrupts | ||
| if (executingHtmTransaction(commit_thread)) { | ||
| cpu->clearInterrupts(0); | ||
| toIEW->commitInfo[0].clearInterrupt = true; | ||
| interrupt = NoFault; | ||
| avoidQuiesceLiveLock = true; | ||
| } else { | ||
| handleInterrupt(); | ||
| } |
There was a problem hiding this comment.
Only the interrupt-handling thread should defer or consume interrupts.
Interrupts are still handled on thread 0, but this block clears interrupt when any commit_thread is inside HTM. If thread 1 is transactional while thread 0 has a pending interrupt, the interrupt is dropped before thread 0 ever processes it.
Suggested fix
- if (interrupt != NoFault) {
- // If inside a transaction, postpone interrupts
- if (executingHtmTransaction(commit_thread)) {
+ if (commit_thread == 0 && interrupt != NoFault) {
+ // If inside a transaction, postpone interrupts
+ if (executingHtmTransaction(0)) {
cpu->clearInterrupts(0);
toIEW->commitInfo[0].clearInterrupt = true;
interrupt = NoFault;
avoidQuiesceLiveLock = true;
} else {
handleInterrupt();
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (interrupt != NoFault) { | |
| // If inside a transaction, postpone interrupts | |
| if (executingHtmTransaction(commit_thread)) { | |
| cpu->clearInterrupts(0); | |
| toIEW->commitInfo[0].clearInterrupt = true; | |
| interrupt = NoFault; | |
| avoidQuiesceLiveLock = true; | |
| } else { | |
| handleInterrupt(); | |
| } | |
| if (commit_thread == 0 && interrupt != NoFault) { | |
| // If inside a transaction, postpone interrupts | |
| if (executingHtmTransaction(0)) { | |
| cpu->clearInterrupts(0); | |
| toIEW->commitInfo[0].clearInterrupt = true; | |
| interrupt = NoFault; | |
| avoidQuiesceLiveLock = true; | |
| } else { | |
| handleInterrupt(); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cpu/o3/commit.cc` around lines 1248 - 1257, The code currently clears and
consumes the pending interrupt whenever any commit_thread is inside HTM
(executingHtmTransaction(commit_thread)), which can drop an interrupt destined
for thread 0; change the logic so only the interrupt-handling thread (thread 0)
may defer/consume the interrupt. Concretely, in the block using
executingHtmTransaction(commit_thread), add a check that commit_thread is the
interrupt-handling thread (i.e., commit_thread corresponds to thread 0) before
calling cpu->clearInterrupts(0), setting toIEW->commitInfo[0].clearInterrupt =
true, clearing interrupt = NoFault, and setting avoidQuiesceLiveLock; if
commit_thread is not thread 0, do not clear or consume the interrupt and instead
call handleInterrupt() so the correct thread processes it.
| // A fence/flush only waits for the requesting thread's sbuffer domain. | ||
| for (ThreadID tid = 0; tid < numThreads; ++tid) { | ||
| if (!storeBufferFlushing(tid) || | ||
| !storeBufferEmpty(tid, _storeBufferFlushBeforeSeq[tid])) { | ||
| continue; | ||
| } | ||
|
|
||
| clearStoreBufferFlushing(tid); | ||
| cpu->activityThisCycle(); |
There was a problem hiding this comment.
Keep the flush armed until the matching SQ range is empty.
This now clears _storeBufferFlushing[tid] based only on the store-buffer domain. If the thread still has older committed stores waiting in SQ, a fence can observe "flush complete" before those stores ever offload.
Suggested fix
- // A fence/flush only waits for the requesting thread's sbuffer domain.
+ // A fence/flush only completes once the requesting thread has no matching
+ // stores left in either SQ or the store buffer.
for (ThreadID tid = 0; tid < numThreads; ++tid) {
- if (!storeBufferFlushing(tid) ||
- !storeBufferEmpty(tid, _storeBufferFlushBeforeSeq[tid])) {
+ const auto flush_before = _storeBufferFlushBeforeSeq[tid];
+ const bool pending_sq =
+ flush_before == static_cast<InstSeqNum>(-1) ?
+ hasStoresToWB(tid) :
+ hasStoresToWBBefore(tid, flush_before);
+ if (!storeBufferFlushing(tid) || pending_sq ||
+ !storeBufferEmpty(tid, flush_before)) {
continue;
}
clearStoreBufferFlushing(tid);
cpu->activityThisCycle();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cpu/o3/lsq.cc` around lines 1033 - 1041, The flush flag
`_storeBufferFlushing[tid]` is being cleared as soon as the store-buffer domain
is empty, but we must keep it armed until the matching store-queue (SQ) range up
to `_storeBufferFlushBeforeSeq[tid]` is also drained; change the condition
around clearStoreBufferFlushing(tid) so it only clears when both
storeBufferEmpty(tid, _storeBufferFlushBeforeSeq[tid]) and the corresponding SQ
is empty (e.g. add a check like storeQueueEmpty(tid,
_storeBufferFlushBeforeSeq[tid]) or equivalent), leaving
`_storeBufferFlushBeforeSeq` and cpu->activityThisCycle() behavior unchanged and
thus preventing a fence from seeing "flush complete" before older committed
stores offload.
| if (!entry->sending) { | ||
| return; | ||
| } | ||
|
|
||
| for (size_t i = 0; i < mask.size(); ++i) { | ||
| if (mask[i] && | ||
| entry->byteGenerations[offset + i] != 0 && | ||
| entry->byteGenerations[offset + i] <= generation) { | ||
| entry->validMask[offset + i] = false; | ||
| } | ||
| } | ||
| }; | ||
|
|
||
| for (ThreadID other_tid = 0; other_tid < numThreads; ++other_tid) { | ||
| if (other_tid == tid) { | ||
| continue; | ||
| } | ||
|
|
||
| auto entry = storeBuffer.get(other_tid, block_paddr); | ||
| if (!entry) { | ||
| continue; | ||
| } | ||
|
|
||
| invalidate_entry(entry); | ||
| invalidate_entry(entry->vice); | ||
| } |
There was a problem hiding this comment.
Invalidate stale vice bytes even before the vice starts sending.
The early !entry->sending return leaves older vice bytes intact. After the current entry drains and the vice is promoted, another thread can forward bytes that were already overwritten by a later visible store.
Suggested fix
auto invalidate_entry = [&](StoreBufferEntry *entry) {
if (!entry || offset + mask.size() > entry->validMask.size()) {
return;
}
-
- if (!entry->sending) {
- return;
- }
for (size_t i = 0; i < mask.size(); ++i) {
if (mask[i] &&
entry->byteGenerations[offset + i] != 0 &&
entry->byteGenerations[offset + i] <= generation) {
entry->validMask[offset + i] = false;
+ entry->byteGenerations[offset + i] = 0;
}
}
};🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cpu/o3/lsq.cc` around lines 2160 - 2185, The lambda invalidate_entry
currently returns early when entry->sending, which prevents clearing stale bytes
in the vice; change it so that it does not skip processing vice entries — either
remove the early "if (!entry->sending) return;" or alter the logic to only skip
invalidation for the primary entry but still iterate and clear bytes for
entry->vice (using entry->byteGenerations, entry->validMask, mask, offset, and
generation). Ensure you still guard null entry pointers (storeBuffer.get and
entry->vice) but always run the invalidation loop for vice entries so
overwritten bytes are cleared before the vice begins sending.
| ThreadID tid = inst->threadNumber; | ||
| assert(canAllocate(tid, 1)); | ||
|
|
||
| // allocate group | ||
| bool alloc = (this->*allocateNewGroup)(inst, tid); |
There was a problem hiding this comment.
Only charge ROB capacity when a new group is actually allocated.
canAllocate(tid, 1) runs before allocateNewGroup() decides whether this instruction consumes another ROB entry. Under compressed-group policies, an instruction that extends the current group should still be accepted even when the thread is at its group limit; this assert turns that legal case into a false full condition.
Suggested fix
- ThreadID tid = inst->threadNumber;
- assert(canAllocate(tid, 1));
-
- // allocate group
- bool alloc = (this->*allocateNewGroup)(inst, tid);
+ ThreadID tid = inst->threadNumber;
+
+ // allocate group
+ bool alloc = (this->*allocateNewGroup)(inst, tid);
+ assert(!alloc || canAllocate(tid, 1));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ThreadID tid = inst->threadNumber; | |
| assert(canAllocate(tid, 1)); | |
| // allocate group | |
| bool alloc = (this->*allocateNewGroup)(inst, tid); | |
| ThreadID tid = inst->threadNumber; | |
| // allocate group | |
| bool alloc = (this->*allocateNewGroup)(inst, tid); | |
| assert(!alloc || canAllocate(tid, 1)); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cpu/o3/rob.cc` around lines 457 - 461, The assert calling
canAllocate(tid, 1) is executed before allocateNewGroup() decides if a new ROB
entry is needed, which blocks legal compressed-group extensions; remove or defer
that check and only charge/assert when a new group is actually allocated: call
(this->*allocateNewGroup)(inst, tid) first (or inspect its return value `alloc`)
and then if alloc is true perform canAllocate(tid, 1) / assert; reference the
ThreadID tid from inst->threadNumber, the allocateNewGroup call and the boolean
`alloc` to gate the capacity assertion.
| HartCtrl::read(PacketPtr pkt) | ||
| { | ||
| assert(pkt->getAddr() >= pioAddr && pkt->getAddr() < pioAddr + pioSize); | ||
| assert(pkt->getSize() > 0 && pkt->getSize() <= sizeof(uint64_t)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify Packet typed accessor contract and current HartCtrl usage.
set -euo pipefail
echo "[1] Locate Packet accessor declarations/definitions"
fd -i 'packet_access.hh'
fd -i 'packet.hh'
echo "[2] Inspect setLE/getLE and any size assertions"
rg -n -C3 'setLE<|getLE<|setRaw<|getRaw<|getSize\(\)' src mem include
echo "[3] Inspect HartCtrl read/write size handling"
rg -n -C5 'HartCtrl::read|HartCtrl::write|setLE\(|getLE<' src/dev/riscv/hart_ctrl.ccRepository: OpenXiangShan/GEM5
Length of output: 50437
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "## hart_ctrl.cc: show lines 1-120 (includes referenced line 24 and ~36-37)"
nl -ba src/dev/riscv/hart_ctrl.cc | sed -n '1,120p'
echo "## hart_ctrl.cc: show lines around any setLE usage"
rg -n "setLE\(|setLE<|getLE\(|setRaw<|setRaw\(" src/dev/riscv/hart_ctrl.cc
echo
for n in 24 36 37; do
echo "## context around line $n"
nl -ba src/dev/riscv/hart_ctrl.cc | sed -n "$((n-10)),$((n+20))p"
echo
done
echo "## packet_access.hh: locate setLE/getLE/setRaw definitions"
rg -n "template <typename T>.*setLE|setLE<|setLE\(|setRaw<|getRaw<|getLE<|Packet::setLE" src/mem/packet_access.hh src/mem/packet_access* || true
echo
echo "## packet_access.hh: print section covering setLE (best-effort window)"
# print a reasonable window around template setLE if rg found a line; otherwise print around where setLE likely is
nl -ba src/mem/packet_access.hh | sed -n '1,220p'Repository: OpenXiangShan/GEM5
Length of output: 198
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "## hart_ctrl.cc: show lines 1-130 (with line numbers)"
awk '{printf "%5d\t%s\n", NR, $0}' src/dev/riscv/hart_ctrl.cc | sed -n '1,130p'
echo
echo "## hart_ctrl.cc: show setLE/getLE occurrences"
rg -n "setLE<|setLE\(|getLE<|getLE\(|setRaw<|setRaw\(|getRaw<|getRaw\(" src/dev/riscv/hart_ctrl.cc || true
echo
echo "## hart_ctrl.cc: context around any setLE calls"
# Grab some wider contexts around each match line (max 30 lines around)
rg -n "setLE<|setLE\(" src/dev/riscv/hart_ctrl.cc || true
echo
echo "## packet_access.hh: locate setLE/getLE/setRaw/getRaw templates"
rg -n "template <typename T>.*setLE|setLE<|Packet::setLE|template <typename T>.*getLE|Packet::getLE|getRaw|setRaw" src/mem/packet_access.hh src/mem/packet.hh || true
echo
echo "## packet_access.hh: print first 260 lines (includes setLE/setRaw likely)"
awk '{printf "%5d\t%s\n", NR, $0}' src/mem/packet_access.hh | sed -n '1,260p'Repository: OpenXiangShan/GEM5
Length of output: 9485
Handle sub-word MMIO reads explicitly.
HartCtrl::read() allows pkt->getSize() of 1/2/4/8 bytes, but it always calls pkt->setLE(hartResetState[tid]) (template-deduced as uint64_t), and Packet::setRaw<T> asserts sizeof(T) <= pkt->getSize()—so 1/2/4-byte reads will trip the assertion (or corrupt memory if asserts are compiled out). Dispatch on pkt->getSize() like write() and serialize the matching width.
💡 Proposed fix
- pkt->setLE(hartResetState[tid]);
+ const uint64_t value = hartResetState[tid];
+ switch (pkt->getSize()) {
+ case sizeof(uint8_t):
+ pkt->setLE<uint8_t>(static_cast<uint8_t>(value));
+ break;
+ case sizeof(uint16_t):
+ pkt->setLE<uint16_t>(static_cast<uint16_t>(value));
+ break;
+ case sizeof(uint32_t):
+ pkt->setLE<uint32_t>(static_cast<uint32_t>(value));
+ break;
+ case sizeof(uint64_t):
+ pkt->setLE<uint64_t>(value);
+ break;
+ default:
+ panic("Unsupported HartCtrl read size %u\n", pkt->getSize());
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/dev/riscv/hart_ctrl.cc` at line 24, HartCtrl::read() currently always
calls pkt->setLE(hartResetState[tid]) as a uint64_t which will fail for
1/2/4-byte MMIO reads; update HartCtrl::read to switch on pkt->getSize()
(1,2,4,8) and call pkt->setLE with the matching-width type (uint8_t, uint16_t,
uint32_t, uint64_t) or use the appropriate Packet::setRaw<T>/serialize path for
each width, keeping the existing size assert and preserving endianness handling
so sub-word reads are correctly serialized without triggering Packet assertions
or corrupting memory.
Change-Id: If3999d5c26bf325cfb052b100456e759388c2330
🚀 Coremark Smoke Test Results
✅ Difftest smoke test passed! |
Summary by CodeRabbit
New Features
Improvements