Skip to content

DLQ replication tasks inaccessible when source/target shard counts differ #10436

@NasitSony

Description

@NasitSony

Summary

In convertTaskToDLQTask() in task_processor.go, failed replication
tasks are stored in the DLQ using p.shard.GetShardID() (the local
shard ID). When source and target clusters have different shard counts
(e.g., 1:2 ratio), GetDLQReplicationMessages cannot locate these
tasks because it looks up DLQ entries by a different shard ID mapping.

Root Cause

convertTaskToDLQTask() always uses the local shard ID:

// TODO: GetShardID will break GetDLQReplicationMessages 
// we need to handle DLQ for cross shard replication.
return &persistence.PutReplicationTaskToDLQRequest{
    ShardID: p.shard.GetShardID(),  // always local shard ID

This assumption breaks when source shard count != target shard count.

Impact

  • Failed replication tasks stored in DLQ are permanently inaccessible
  • GetDLQReplicationMessages (handler.go:1536) cannot retrieve them
  • DLQ merging operations silently fail to find affected tasks
  • No error is logged — silent data loss under cross-shard replication
  • Affects all 4 task types: SYNC_ACTIVITY, HISTORY_V2,
    SYNC_WORKFLOW_STATE, SYNC_HSM

Relationship to #10224

This is related to #10224 (replication tasks not cleaned up with
mismatched shard counts) — same root cause, different symptom:

Steps to Reproduce

  1. Set up two clusters with different shard counts
    • Source: history.numberOfShards = 8192
    • Target: history.numberOfShards = 16384
  2. Run workflows that generate replication failures
  3. Attempt DLQ merge via GetDLQReplicationMessages
  4. Observe: DLQ tasks not found despite being written

Suggested Fix

Similar to the fix in PR #10226, use common.MapShardID to determine
the correct target shard IDs when writing to DLQ, ensuring consistency
with how GetDLQReplicationMessages retrieves them.

References

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions