Description
The DKG timeout works differently from Charon. Charon splits the timeout per phase (conf.Timeout / 6); Pluto uses one overall timeout for the whole run, because our sync service runs across all phases and can't be cut per-phase (we tried — it killed healthy runs).
Two gaps come from this. A stuck peer is only caught after the full timeout (~60s) instead of one phase (~10s). And with many validators (untested) a healthy run takes longer and could hit the timeout and be killed by mistake — made worse by parsigex already using the full conf.timeout per exchange.
Works today: stuck peer aborts cleanly, and a small healthy cluster finishes fine. Not covered: many-validator runs, and there's no automated test for the timeout.
Benefit (The "Why")
A DKG should never hang forever on a stuck peer, but it also must never kill a healthy ceremony by mistake. Closing these gaps makes the timeout reliable for real clusters (many validators), gives faster failure detection, and protects it with an automated test.
Acceptance Criteria
- A healthy DKG with many validators completes without a false timeout under the default
--timeout.
- The relationship between the overall timeout and the per-exchange
parsigex timeout is consistent (one budget is not silently exceeded by the other).
- A stuck/absent peer still aborts cleanly with
DkgError::Timeout.
- An automated test covers the overall timeout (stuck peer aborts) and the healthy path (no false timeout).
Description
The DKG timeout works differently from Charon. Charon splits the timeout per phase (
conf.Timeout / 6); Pluto uses one overall timeout for the whole run, because our sync service runs across all phases and can't be cut per-phase (we tried — it killed healthy runs).Two gaps come from this. A stuck peer is only caught after the full timeout (~60s) instead of one phase (~10s). And with many validators (untested) a healthy run takes longer and could hit the timeout and be killed by mistake — made worse by
parsigexalready using the fullconf.timeoutper exchange.Works today: stuck peer aborts cleanly, and a small healthy cluster finishes fine. Not covered: many-validator runs, and there's no automated test for the timeout.
Benefit (The "Why")
A DKG should never hang forever on a stuck peer, but it also must never kill a healthy ceremony by mistake. Closing these gaps makes the timeout reliable for real clusters (many validators), gives faster failure detection, and protects it with an automated test.
Acceptance Criteria
--timeout.parsigextimeout is consistent (one budget is not silently exceeded by the other).DkgError::Timeout.