Summary
A single interval workflow whose latest workflowexecution.status is stuck at in_progress
causes the scheduler to spam Workflow <id> is already running by someone else and, as a side
effect, event-driven (alert) workflows stop being dispatched even though alerts continue to
be ingested into the alert table.
In our case ~57,000 spam log lines were produced in ~32 hours and at least 77 alert events
(across multiple workflows / fingerprints) were ingested but never produced a corresponding
workflowexecution row.
Environment
- Keep version:
v0.51.0
- Deployment: docker-compose (single backend container)
- DB: PostgreSQL 15
- Redis: enabled (
REDIS=true)
- Auth: oauth2-proxy / keycloak
Repro
- Create or already-have an interval workflow (
interval > 0, e.g. a daily report workflow).
- Cause its latest
workflowexecution to be left at status='in_progress' (e.g. backend
container crash / OOM kill / network blip during execution, before INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT
elapses, or any path that leaves the row not finished and not timed out yet — in our case
we don't yet know the exact crash trigger, but the row simply stays in_progress forever
because nothing updates it).
- Observe backend logs:
<ts> DEBUG keep.api.core.db: Workflow <id> is already running by someone else
<ts> DEBUG keep.api.core.db: Workflow <id> is already running by someone else
... (repeats every scheduler iteration, ~1/s)
- Send alert events to other (event-triggered) workflows. Observe that:
alert table receives the rows.
- No corresponding
workflowexecution row is created for the event workflows that
would otherwise have matched.
UPDATE workflow SET is_disabled = true WHERE id = '<stuck-id>'; — spam stops within ~1
minute and event dispatch resumes.
Evidence from our incident
-- 77 alert events with no workflowexecution after them, across ~3 fingerprints
SELECT a.fingerprint, COUNT(*) AS alerts_in,
SUM(CASE WHEN we.id IS NULL THEN 1 ELSE 0 END) AS missed
FROM alert a
LEFT JOIN workflowexecution we
ON we.triggered_by LIKE 'type:alert%' AND we.started >= a.timestamp
AND we.started <= a.timestamp + interval '60s'
WHERE a.timestamp >= '2026-05-25 18:00:00'
GROUP BY 1;
(numbers from our DB; redacted)
Root-cause analysis
The scheduler main loop is strictly serial:
https://github.com/keephq/keep/blob//keep/workflowmanager/workflowscheduler.py#L648
while not self._stop:
self._handle_interval_workflows()
self._handle_event_workflows()
time.sleep(1)
_handle_interval_workflows() calls get_workflows_that_should_run():
https://github.com/keephq/keep/blob//keep/api/core/db.py#L319
For every interval workflow it does:
get_last_completed_execution(...)
- if there's an
ongoing_execution.status == 'in_progress' and it hasn't passed
INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT yet, it just logs is already running by someone else
and continues.
If INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT is long (or never reached because of clock/state),
the row stays in_progress forever and every scheduler iteration re-runs the same N SQL
statements per interval workflow. Under our connection pool / DB latency, this stretches a
single iteration enough that _handle_event_workflows() falls far behind, and (we suspect)
the workflows_to_run queue contention or DB pool starvation causes event workflow dispatch
to silently drop events.
What we'd like to see
- Decouple interval and event scheduling, e.g. run them on independent threads / tasks,
so a slow/stuck interval workflow cannot starve event dispatch.
- Self-heal stuck
in_progress rows more aggressively (e.g. shorter default
INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT, or a watchdog that finishes orphans on backend
startup based on container restart time).
- Rate-limit the
is already running log (one per workflow per N minutes is enough).
- Optional: expose a
/healthz / metric that surfaces "stuck interval workflows count" so
ops can alert on it.
Workaround
UPDATE workflow SET is_disabled = true WHERE id = '<stuck-id>';
UPDATE workflowexecution SET is_running = 0, status='error',
error='manual cleanup of stuck execution'
WHERE workflow_id = '<stuck-id>' AND is_running = 1;
Spam stops within ~1 minute, event-workflow dispatch resumes immediately.
Additional notes
arq worker / process_event_task itself completes (we see it writing to alert and
alertdeduplicationrule); the missing piece is the workflow scheduler picking the alert
up from the in-memory workflows_to_run queue.
- Default deduplication rule fires normally during the incident, ruling out dedup as the
cause of the missing dispatches.
Happy to share more logs / SQL output if helpful.
Summary
A single interval workflow whose latest
workflowexecution.statusis stuck atin_progresscauses the scheduler to spam
Workflow <id> is already running by someone elseand, as a sideeffect, event-driven (alert) workflows stop being dispatched even though alerts continue to
be ingested into the
alerttable.In our case ~57,000 spam log lines were produced in ~32 hours and at least 77 alert events
(across multiple workflows / fingerprints) were ingested but never produced a corresponding
workflowexecutionrow.Environment
v0.51.0REDIS=true)Repro
interval > 0, e.g. a daily report workflow).workflowexecutionto be left atstatus='in_progress'(e.g. backendcontainer crash / OOM kill / network blip during execution, before
INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUTelapses, or any path that leaves the row not finished and not timed out yet — in our case
we don't yet know the exact crash trigger, but the row simply stays
in_progressforeverbecause nothing updates it).
alerttable receives the rows.workflowexecutionrow is created for the event workflows thatwould otherwise have matched.
UPDATE workflow SET is_disabled = true WHERE id = '<stuck-id>';— spam stops within ~1minute and event dispatch resumes.
Evidence from our incident
(numbers from our DB; redacted)
Root-cause analysis
The scheduler main loop is strictly serial:
https://github.com/keephq/keep/blob//keep/workflowmanager/workflowscheduler.py#L648
_handle_interval_workflows()callsget_workflows_that_should_run():https://github.com/keephq/keep/blob//keep/api/core/db.py#L319
For every interval workflow it does:
get_last_completed_execution(...)ongoing_execution.status == 'in_progress'and it hasn't passedINTERVAL_WORKFLOWS_RELAUNCH_TIMEOUTyet, it just logsis already running by someone elseand
continues.If
INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUTis long (or never reached because of clock/state),the row stays
in_progressforever and every scheduler iteration re-runs the same N SQLstatements per interval workflow. Under our connection pool / DB latency, this stretches a
single iteration enough that
_handle_event_workflows()falls far behind, and (we suspect)the
workflows_to_runqueue contention or DB pool starvation causes event workflow dispatchto silently drop events.
What we'd like to see
so a slow/stuck interval workflow cannot starve event dispatch.
in_progressrows more aggressively (e.g. shorter defaultINTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT, or a watchdog that finishes orphans on backendstartup based on container restart time).
is already runninglog (one per workflow per N minutes is enough)./healthz/ metric that surfaces "stuck interval workflows count" soops can alert on it.
Workaround
Spam stops within ~1 minute, event-workflow dispatch resumes immediately.
Additional notes
arqworker /process_event_taskitself completes (we see it writing toalertandalertdeduplicationrule); the missing piece is the workflow scheduler picking the alertup from the in-memory
workflows_to_runqueue.cause of the missing dispatches.
Happy to share more logs / SQL output if helpful.