Skip to content

fix: enforce total message ordering via TCE for all clients#65

Open
justynspooner wants to merge 1 commit intotashigit:mainfrom
justynspooner:fix/enforce-tce-message-ordering
Open

fix: enforce total message ordering via TCE for all clients#65
justynspooner wants to merge 1 commit intotashigit:mainfrom
justynspooner:fix/enforce-tce-message-ordering

Conversation

@justynspooner
Copy link
Copy Markdown

Warning

Proof of concept — authored with the assistance of Claude 4.6 (Opus) and has not been compiled or tested as I don't have SSH access to the private tashi-consensus-engine dependency. rustfmt passes but cargo check/clippy could not be run.

Summary

  • When TCE is active, skip eager local dispatch for locally-originated messages
  • All messages (including our own) are now dispatched from handle_tce_event after consensus is reached
  • Guarantees every subscriber sees messages in the same total order, regardless of which broker the publisher is connected to

Problem

Previously, a client publishing and subscribing on the same broker would receive its own messages immediately via local dispatch — potentially in a different order than subscribers on other brokers. The TCE provides a total ordering, but the local dispatch bypassed it.

Tradeoff

Locally-originated messages now have slightly higher latency as they must complete the consensus round-trip before delivery. This is the correct behaviour for a system that requires total ordering guarantees.

Test plan

  • Verify messages are delivered in the same order to all subscribers across brokers
  • Verify locally-originated messages are no longer delivered before consensus
  • Verify standalone (non-TCE) mode is unaffected — local dispatch still works
  • Run existing integration tests

When TCE is active, skip the eager local dispatch for locally-originated
messages. Instead, all messages (including our own) are dispatched from
handle_tce_event after consensus is reached. This guarantees that every
subscriber sees messages in the same total order, regardless of which
broker the publisher is connected to.

Previously, a client publishing and subscribing on the same broker would
receive its own messages immediately via local dispatch, potentially in
a different order than subscribers on other brokers.

Tradeoff: locally-originated messages now have slightly higher latency
as they must complete the consensus round-trip before delivery.
Copy link
Copy Markdown
Contributor

@Divyanshu11011 Divyanshu11011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!
Verified the bug and tested the fix end-to-end against your foxmq-glorious-glitch reproduction with alocally built PR branch.

What works

Ordering bug is real and your fix resolves it. With the v0.3.1 release binary your 4-node demo produces 4 divergent orderings (every node sees its own 5
messages first, then remotes in consensus order). Swapping in this PR's binary: all 4 subscribers observe identical order across 20 messages. Core change funnel every publish through handle_tce_event, drop the event.creator != platform.creator_id() gate is the right direction.

Blocker ACL bypass regression

The publish ACL check at src/mqtt/router.rs:1011-1026 lives inside dispatch() and only runs when origin.get_client_index() returns Some, which is
only true for PublishOrigin::Local. This PR skips the Local dispatch when TCE is active, so locally-originated publishes now take only the Consensus
path, which bypasses the ACL check entirely.

Reproduction (4-node cluster, --allow-anonymous-login):

permissions.toml:

[permissions."*"]
topic = [
  { filter = "blocked/#", allowed = ["subscribe"], denied = ["publish"] },
]

Anonymous client publishes QoS 1 to blocked/secret:

┌─────────┬───────────────────────────┬──────────────────────────────┬─────────┐
│ Branch  │ sub on publisher's broker │     sub on remote broker     │ PUBACK? │
├─────────┼───────────────────────────┼──────────────────────────────┼─────────┤
│ main    │ 0 (blocked by Local ACL)  │ 1 (pre-existing remote-leak) │ sent    │
├─────────┼───────────────────────────┼──────────────────────────────┼─────────┤
│ this PR │ 1 (leaked)                │ 1 (leaked)                   │ sent    │
└─────────┴───────────────────────────┴──────────────────────────────┴─────────┘

Before this PR, local subscribers were protected by the Local-origin ACL check. This PR removes that protection. Publishers also receive
PubAckReason::Success for denied publishes because nothing along the connection.rs:721 → 743 → router.transaction() path looks at permissions.

Suggested fix: hoist the publish ACL check up to src/mqtt/broker/connection.rs:683, right after validate_and_convert returns a Transaction and before     
tce_platform.reserve_tx() at line 721. On denial, respond per QoS with PubAckReason::NotAuthorized / PubRecReason::NotAuthorized / silent drop for QoS 0. 
That also closes the pre-existing remote-leak bug  denied publishes never reach TCE in the first place.

Non-blocker  investigated, not a problem

Initially suspected will messages would double-dispatch on the originator's broker (TCE path + leftover SystemCommand::PublishWill local-dispatch path).  
Tested: will is delivered exactly once on every broker. The local-dispatch path is dead code when TCE is active  handle_connection_lost sends
SystemCommand::Disconnected before dispatch_will sends SystemCommand::PublishWill on the same system_tx channel, so for clean_session=true the willing    
client is evicted first and dispatch() returns early on state.clients.get(willing_client) == None. Worth a comment in handle_system_command::PublishWill  
noting this ordering assumption, but not a blocker.

Nits

- src/mqtt/router.rs:829  platform: &Platform parameter on handle_tce_event is unused after the PR (clippy flags it). Drop it or prefix _platform.       
- A Rust integration test spawning a 2-broker cluster and asserting identical receive order would lock this invariant in. Happy to help write one.        

Nice catch and clean fix. Once the ACL path is moved up, this is good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants