oracle: fix consensus deadlock from stale attestation timestamps#383
Conversation
When oracle consensus stalls (e.g., from rapid block production or network partition), all oracles create attestations with the same frozen (price, timestamp) tuple. The Phase2 hash—computed from (oracle_id, price, timestamp)—is identical each cycle, so the duplicate filter in AddOracleMessage() permanently rejects it. Since BroadcastMessage() calls AddOracleMessage() before pushing to P2P, the message never reaches the network. All oracle nodes enter this state simultaneously, creating a network-wide deadlock with no automatic recovery. Three-part fix: 1. Stale consensus detection (node.cpp): When the consensus timestamp is >5 minutes old, BroadcastCurrentPrice() clears stale pending state and falls through to individual price broadcast with GetTime() as timestamp. The fresh timestamp produces a different Phase2 hash, breaking the duplicate filter collision and allowing the message to propagate. 2. Periodic seen-hash cleanup (bundle_manager.cpp): Clear seen_message_hashes every 300 seconds in AddOracleMessage(). This provides automatic recovery even without the stale consensus detection. The pending_messages map (keyed by oracle_id, accepting only newer timestamps) provides authoritative dedup; the seen hash set is a best-effort P2P optimization only. 3. State reset on stoporacle (digidollar.cpp): stoporacle now calls ClearPendingMessages() to reset seen_message_hashes, pending_messages, and pending_attestations. Gives operators a manual recovery path via stoporacle/startoracle cycle. Fixes the oracle attestation stuck-at-4/5 bug observed on testnet19 where 8 oracles reported valid prices but consensus_price remained 0 for 19+ hours until an operator manually restarted their node. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey Johnny, thank you so much for digging into this and putting in the time to track down the root cause. The deadlock analysis is excellent — the frozen timestamp → identical Phase2 hash → permanent duplicate filter rejection chain is exactly what we were seeing on testnet19 and you nailed it. The three-part oracle fix (stale consensus detection, periodic seen-hash cleanup, and state reset on I ran the full test suite locally — 1,969 C++ unit tests pass and all 311 functional tests pass. However, both CI checks (macOS and Ubuntu) are failing on the PR, and I noticed the branch was forked from before my latest RC22 RPC display fix commit ( No worries at all — I'm going to cherry-pick your oracle fix commit directly onto our current branch tip so everything stays clean. Your authorship will be preserved in the commit. This way we keep both the oracle deadlock fix and the RC22 RPC corrections in one clean history. Thank you again for the incredible contribution. This kind of deep debugging work is exactly what we need as we push toward mainnet. The whole testnet19 testing effort from everyone has been invaluable. 🚀💎 |
6963ed5
into
DigiByte-Core:feature/digidollar-v1
Summary
Fixes the oracle attestation deadlock where consensus gets permanently stuck at 0 despite all oracles reporting valid prices. Observed on testnet19 — 8/9 oracles reporting ~$0.00445 but
consensus_price_micro_usd: 0andlast_bundle_height: 0for 19+ hours.Root Cause
When a consensus round stalls, all oracles create attestations with the same frozen
(price, timestamp)tuple. The Phase2 hash —H(oracle_id, price, timestamp)— is identical each cycle, so the duplicate filter inAddOracleMessage()permanently rejects it. SinceBroadcastMessage()callsAddOracleMessage()before pushing to P2P, the message never reaches the network. All oracle nodes enter this state simultaneously with no automatic recovery.The deadlock chain:
pending_messagescontains entries with stale timestamp TComputeConsensusValues()returnsconsensus_timestamp = T(median of stale entries)(price, T)→ Phase2 hash HAddOracleMessage()finds H inseen_message_hashes→ rejected as duplicateBroadcastMessage()returns false → message never reaches P2PThree-Part Fix
Stale consensus detection (
src/oracle/node.cpp): When consensus timestamp is >5 minutes old,BroadcastCurrentPrice()clears stale state and falls through to individual price broadcast withGetTime(). Fresh timestamp → different Phase2 hash → passes duplicate filter → propagates to network → new consensus forms.Periodic seen-hash cleanup (
src/oracle/bundle_manager.cpp): Clearseen_message_hashesevery 300 seconds inAddOracleMessage(). Safety net for automatic recovery. Thepending_messagesmap (keyed by oracle_id, accepting only newer timestamps) provides authoritative dedup.State reset on
stoporacle(src/rpc/digidollar.cpp):stoporaclenow callsClearPendingMessages()to reset the duplicate filter. Gives operators a manual recovery path.Relationship to RC22
pending_messages.clear()FixThe RC22 fix (
f7a4b1d) removedpending_messages.clear()fromAddOracleBundleToBlock()to prevent messages being drained on everyCreateNewBlock()call. This fix addresses a different but related deadlock: messages survive template creation but get permanently trapped in the duplicate filter when the consensus timestamp freezes.Test plan
stoporacle/startoraclecycle clears state and allows fresh consensusoracle_bundle_manager_tests,oracle_phase2_tests)🤖 Generated with Claude Code