Environment
- Temporal Server: 1.31.0 (
temporalio/server:1.31.0)
- Deployment: all-in-one Docker Compose (single container, all four services)
- PostgreSQL: DigitalOcean Managed (db-s-1vcpu-1gb) with TLS required
- Clean install: databases dropped and recreated, schemas applied fresh
Summary
When a parent workflow successfully completes a task that initiates a child workflow, there's a race condition between the workflow task completion path and the transfer queue processor. The transfer queue's TransferStartChildExecution task tries to update the parent workflow's mutable state to record that the child started, but the transaction has already been committed/rolled back by the completion path.
This triggers shard re-acquisition every ~5-6 seconds in a permanent loop. The child workflow gets created with only a WorkflowExecutionStarted event — no WorkflowTaskScheduled event ever gets written, so the child workflow never executes.
Reproduction
Occurs when a parent workflow task completes successfully and initiates a child workflow. Observed with temporal-sys-delete-namespace-workflow starting temporal-sys-reclaim-namespace-resources-workflow, but likely affects any parent/child workflow combination under the right timing conditions.
Observed Behavior
Timeline (from server logs)
00:53:27 Delete-namespace workflow task starts
00:53:31 Namespace renamed successfully (local activities complete)
00:53:32-35 "Workflow is busy" errors
00:53:37 First "sql: transaction has already been committed or rolled back"
00:53:38 Shard re-acquired (churn starts)
The pattern repeats every ~5-6 seconds.
Transfer Queue Error
{"level":"warn","ts":"2026-05-19T00:53:37.427Z",
"msg":"Fail to process task",
"component":"transfer-queue-processor",
"queue-task-type":"TransferStartChildExecution",
"queue-task":"StartChildExecutionTask{...TargetWorkflowID: temporal-sys-reclaim-namespace-resources-workflow/smoke-test-deleted-ba6f9, InitiatedEventID: 11...}",
"error":"UpdateWorkflowExecution operation failed. Failed to commit transaction. Error: sql: transaction has already been committed or rolled back",
"error-type":"serviceerror.Unavailable",
"attempt":2}
Concurrent Access Indicators
Before the SQL error, logs show:
{"level":"info","msg":"history client encountered error",
"error":"Workflow is busy.",
"service-error-type":"serviceerror.ResourceExhausted"}
This indicates concurrent attempts to access the parent workflow's mutable state.
Error Pattern
Transfer task retries alternate between two errors:
sql: transaction has already been committed or rolled back (primary)
context deadline exceeded (when queue processor times out)
Each failure triggers shard re-acquisition, but the churn doesn't clear the broken state — it's a symptom, not the cause.
Child Workflow State
The child workflow (temporal-sys-reclaim-namespace-resources-workflow) gets created but stalls with only 1 event in its history:
ID Time Type
1 2026-05-19T00:53:31Z WorkflowExecutionStarted
No WorkflowTaskScheduled event is ever written, so the child workflow never executes.
Root Cause Analysis
Race condition in history service:
- Parent workflow task completes, writes event 11 (
StartChildWorkflowExecutionInitiated)
- Workflow task completion path has an open transaction on parent workflow's mutable state
- Transfer queue processor immediately picks up
TransferStartChildExecution task
- Transfer task calls
UpdateWorkflowExecution on parent to record child started
- By the time the transfer task's transaction executes, the completion path's transaction is already committed/rolled back
UpdateWorkflowExecution fails with "transaction already committed or rolled back"
- Error triggers shard re-acquisition
- Shard re-acquisition doesn't clear the broken state — task retries indefinitely
Code path: transfer_queue_active_task_executor.go:1090 (processStartChildExecution) → recordChildExecutionStarted → UpdateWorkflowExecution → SQL commit fails
Impact
- Child workflows fail to execute when parent workflow tasks complete and start children
- Observed with
temporal operator namespace delete (starts reclaim-resources child)
- Likely affects any workflow pattern that starts child workflows
- Creates permanent shard churn (re-acquisition every ~5-6s)
Relationship to Other Issues
This bug was hidden by #10320 (first workflow task hang) in most test runs. It only manifested when the parent workflow task completed successfully, which occurred during active shard churn that accidentally cleared the matching service's stale state.
Environment
temporalio/server:1.31.0)Summary
When a parent workflow successfully completes a task that initiates a child workflow, there's a race condition between the workflow task completion path and the transfer queue processor. The transfer queue's
TransferStartChildExecutiontask tries to update the parent workflow's mutable state to record that the child started, but the transaction has already been committed/rolled back by the completion path.This triggers shard re-acquisition every ~5-6 seconds in a permanent loop. The child workflow gets created with only a
WorkflowExecutionStartedevent — noWorkflowTaskScheduledevent ever gets written, so the child workflow never executes.Reproduction
Occurs when a parent workflow task completes successfully and initiates a child workflow. Observed with
temporal-sys-delete-namespace-workflowstartingtemporal-sys-reclaim-namespace-resources-workflow, but likely affects any parent/child workflow combination under the right timing conditions.Observed Behavior
Timeline (from server logs)
The pattern repeats every ~5-6 seconds.
Transfer Queue Error
{"level":"warn","ts":"2026-05-19T00:53:37.427Z", "msg":"Fail to process task", "component":"transfer-queue-processor", "queue-task-type":"TransferStartChildExecution", "queue-task":"StartChildExecutionTask{...TargetWorkflowID: temporal-sys-reclaim-namespace-resources-workflow/smoke-test-deleted-ba6f9, InitiatedEventID: 11...}", "error":"UpdateWorkflowExecution operation failed. Failed to commit transaction. Error: sql: transaction has already been committed or rolled back", "error-type":"serviceerror.Unavailable", "attempt":2}Concurrent Access Indicators
Before the SQL error, logs show:
{"level":"info","msg":"history client encountered error", "error":"Workflow is busy.", "service-error-type":"serviceerror.ResourceExhausted"}This indicates concurrent attempts to access the parent workflow's mutable state.
Error Pattern
Transfer task retries alternate between two errors:
sql: transaction has already been committed or rolled back(primary)context deadline exceeded(when queue processor times out)Each failure triggers shard re-acquisition, but the churn doesn't clear the broken state — it's a symptom, not the cause.
Child Workflow State
The child workflow (
temporal-sys-reclaim-namespace-resources-workflow) gets created but stalls with only 1 event in its history:No
WorkflowTaskScheduledevent is ever written, so the child workflow never executes.Root Cause Analysis
Race condition in history service:
StartChildWorkflowExecutionInitiated)TransferStartChildExecutiontaskUpdateWorkflowExecutionon parent to record child startedUpdateWorkflowExecutionfails with "transaction already committed or rolled back"Code path:
transfer_queue_active_task_executor.go:1090(processStartChildExecution) →recordChildExecutionStarted→UpdateWorkflowExecution→ SQL commit failsImpact
temporal operator namespace delete(starts reclaim-resources child)Relationship to Other Issues
This bug was hidden by #10320 (first workflow task hang) in most test runs. It only manifested when the parent workflow task completed successfully, which occurred during active shard churn that accidentally cleared the matching service's stale state.