Description
There is an II mismatch between the compiler's schedule and the RTL execution in the BiCG kernel. The compiler schedules under the assumption of zero-delay for cross-tile data movement, but the RTL implementation introduces a 1-cycle routing pipeline register delay on mesh links.
When this routing delay falls on a fully scheduled operation slot with no NAH slots to absorb it, it causes data misalignment (one operand arrives late). This results in an unexpected 1-cycle stall right before the operation can execute, directly increasing the overall II.
Log Trace
We can observe this behavior on an add (+) operation at tile 4 (t4), which ends up occupying two cycles instead of one:
cyc= 280 | t0:a9t29(NAH) | t1:a9t29(grant_pred)✓ | t2:a0t30(NAH) | t4:a9t29(+) ◇ | t5:a8t28(!) ✓ | t6:a0t30(ret_void) ◇ | t8:a8t28(*) ✓ | t9:a8t28(*) ☒ | t12:a0t10(st) ◇ | t13:a8t8(NAH)
cyc= 281 | t0:a0t30(grant_once')✓ | t1:a9t29(grant_pred)☒ | t2:a1t31(NAH) | t4:a9t29(+) ✓ | t5:a9t29(grant_pred)✓ | t6:a0t30(ret_void) ◇ | t8:a9t29(strdcst) ◇ | t9:a8t28(*) ☒ | t12:a0t10(st) ◇ | t13:a9t9(+) ✓
As shown in PE (0,1). The `ADD` takes 2 cycles, the first one is actually the stall.
<img width="4201" height="1253" alt="Image" src="https://github.com/user-attachments/assets/fd147019-cb80-433a-9800-6f055eeac061" />
Description
There is an II mismatch between the compiler's schedule and the RTL execution in the BiCG kernel. The compiler schedules under the assumption of zero-delay for cross-tile data movement, but the RTL implementation introduces a 1-cycle routing pipeline register delay on mesh links.
When this routing delay falls on a fully scheduled operation slot with no
NAHslots to absorb it, it causes data misalignment (one operand arrives late). This results in an unexpected 1-cycle stall right before the operation can execute, directly increasing the overall II.Log Trace
We can observe this behavior on an
add(+) operation attile 4(t4), which ends up occupying two cycles instead of one: