Skip to content

SQL transaction race in TransferStartChildExecution causes permanent workflow stall #10321

@theredspoon

Description

@theredspoon

Environment

  • Temporal Server: 1.31.0 (temporalio/server:1.31.0)
  • Deployment: all-in-one Docker Compose (single container, all four services)
  • PostgreSQL: DigitalOcean Managed (db-s-1vcpu-1gb) with TLS required
  • Clean install: databases dropped and recreated, schemas applied fresh

Summary

When a parent workflow successfully completes a task that initiates a child workflow, there's a race condition between the workflow task completion path and the transfer queue processor. The transfer queue's TransferStartChildExecution task tries to update the parent workflow's mutable state to record that the child started, but the transaction has already been committed/rolled back by the completion path.

This triggers shard re-acquisition every ~5-6 seconds in a permanent loop. The child workflow gets created with only a WorkflowExecutionStarted event — no WorkflowTaskScheduled event ever gets written, so the child workflow never executes.

Reproduction

Occurs when a parent workflow task completes successfully and initiates a child workflow. Observed with temporal-sys-delete-namespace-workflow starting temporal-sys-reclaim-namespace-resources-workflow, but likely affects any parent/child workflow combination under the right timing conditions.

Observed Behavior

Timeline (from server logs)

00:53:27  Delete-namespace workflow task starts
00:53:31  Namespace renamed successfully (local activities complete)
00:53:32-35  "Workflow is busy" errors
00:53:37  First "sql: transaction has already been committed or rolled back"
00:53:38  Shard re-acquired (churn starts)

The pattern repeats every ~5-6 seconds.

Transfer Queue Error

{"level":"warn","ts":"2026-05-19T00:53:37.427Z",
 "msg":"Fail to process task",
 "component":"transfer-queue-processor",
 "queue-task-type":"TransferStartChildExecution",
 "queue-task":"StartChildExecutionTask{...TargetWorkflowID: temporal-sys-reclaim-namespace-resources-workflow/smoke-test-deleted-ba6f9, InitiatedEventID: 11...}",
 "error":"UpdateWorkflowExecution operation failed. Failed to commit transaction. Error: sql: transaction has already been committed or rolled back",
 "error-type":"serviceerror.Unavailable",
 "attempt":2}

Concurrent Access Indicators

Before the SQL error, logs show:

{"level":"info","msg":"history client encountered error",
 "error":"Workflow is busy.",
 "service-error-type":"serviceerror.ResourceExhausted"}

This indicates concurrent attempts to access the parent workflow's mutable state.

Error Pattern

Transfer task retries alternate between two errors:

  1. sql: transaction has already been committed or rolled back (primary)
  2. context deadline exceeded (when queue processor times out)

Each failure triggers shard re-acquisition, but the churn doesn't clear the broken state — it's a symptom, not the cause.

Child Workflow State

The child workflow (temporal-sys-reclaim-namespace-resources-workflow) gets created but stalls with only 1 event in its history:

ID  Time                     Type
 1  2026-05-19T00:53:31Z     WorkflowExecutionStarted

No WorkflowTaskScheduled event is ever written, so the child workflow never executes.

Root Cause Analysis

Race condition in history service:

  1. Parent workflow task completes, writes event 11 (StartChildWorkflowExecutionInitiated)
  2. Workflow task completion path has an open transaction on parent workflow's mutable state
  3. Transfer queue processor immediately picks up TransferStartChildExecution task
  4. Transfer task calls UpdateWorkflowExecution on parent to record child started
  5. By the time the transfer task's transaction executes, the completion path's transaction is already committed/rolled back
  6. UpdateWorkflowExecution fails with "transaction already committed or rolled back"
  7. Error triggers shard re-acquisition
  8. Shard re-acquisition doesn't clear the broken state — task retries indefinitely

Code path: transfer_queue_active_task_executor.go:1090 (processStartChildExecution) → recordChildExecutionStartedUpdateWorkflowExecution → SQL commit fails

Impact

  • Child workflows fail to execute when parent workflow tasks complete and start children
  • Observed with temporal operator namespace delete (starts reclaim-resources child)
  • Likely affects any workflow pattern that starts child workflows
  • Creates permanent shard churn (re-acquisition every ~5-6s)

Relationship to Other Issues

This bug was hidden by #10320 (first workflow task hang) in most test runs. It only manifested when the parent workflow task completed successfully, which occurred during active shard churn that accidentally cleared the matching service's stale state.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions