Summary
In convertTaskToDLQTask() in task_processor.go, failed replication
tasks are stored in the DLQ using p.shard.GetShardID() (the local
shard ID). When source and target clusters have different shard counts
(e.g., 1:2 ratio), GetDLQReplicationMessages cannot locate these
tasks because it looks up DLQ entries by a different shard ID mapping.
Root Cause
convertTaskToDLQTask() always uses the local shard ID:
// TODO: GetShardID will break GetDLQReplicationMessages
// we need to handle DLQ for cross shard replication.
return &persistence.PutReplicationTaskToDLQRequest{
ShardID: p.shard.GetShardID(), // always local shard ID
This assumption breaks when source shard count != target shard count.
Impact
- Failed replication tasks stored in DLQ are permanently inaccessible
GetDLQReplicationMessages (handler.go:1536) cannot retrieve them
- DLQ merging operations silently fail to find affected tasks
- No error is logged — silent data loss under cross-shard replication
- Affects all 4 task types: SYNC_ACTIVITY, HISTORY_V2,
SYNC_WORKFLOW_STATE, SYNC_HSM
Relationship to #10224
This is related to #10224 (replication tasks not cleaned up with
mismatched shard counts) — same root cause, different symptom:
Steps to Reproduce
- Set up two clusters with different shard counts
- Source:
history.numberOfShards = 8192
- Target:
history.numberOfShards = 16384
- Run workflows that generate replication failures
- Attempt DLQ merge via
GetDLQReplicationMessages
- Observe: DLQ tasks not found despite being written
Suggested Fix
Similar to the fix in PR #10226, use common.MapShardID to determine
the correct target shard IDs when writing to DLQ, ensuring consistency
with how GetDLQReplicationMessages retrieves them.
References
Summary
In
convertTaskToDLQTask()intask_processor.go, failed replicationtasks are stored in the DLQ using
p.shard.GetShardID()(the localshard ID). When source and target clusters have different shard counts
(e.g., 1:2 ratio),
GetDLQReplicationMessagescannot locate thesetasks because it looks up DLQ entries by a different shard ID mapping.
Root Cause
convertTaskToDLQTask()always uses the local shard ID:This assumption breaks when source shard count != target shard count.
Impact
GetDLQReplicationMessages(handler.go:1536) cannot retrieve themSYNC_WORKFLOW_STATE, SYNC_HSM
Relationship to #10224
This is related to #10224 (replication tasks not cleaned up with
mismatched shard counts) — same root cause, different symptom:
Steps to Reproduce
history.numberOfShards = 8192history.numberOfShards = 16384GetDLQReplicationMessagesSuggested Fix
Similar to the fix in PR #10226, use
common.MapShardIDto determinethe correct target shard IDs when writing to DLQ, ensuring consistency
with how
GetDLQReplicationMessagesretrieves them.References
service/history/replication/task_processor.golines 372, 405, 437, 455service/history/handler.golines 1535-1601service/history/replication/dlq_handler.goline 301cc @temporalio/server — would appreciate
potential-buglabel if appropriate