Skip to content

🐛 Bug: Alert is persisted and audited as triggered, but matching event workflow is not executed #6531

@haonanan

Description

@haonanan

Summary

A single interval workflow whose latest workflowexecution.status is stuck at in_progress
causes the scheduler to spam Workflow <id> is already running by someone else and, as a side
effect, event-driven (alert) workflows stop being dispatched even though alerts continue to
be ingested into the alert table.

In our case ~57,000 spam log lines were produced in ~32 hours and at least 77 alert events
(across multiple workflows / fingerprints) were ingested but never produced a corresponding
workflowexecution row.

Environment

  • Keep version: v0.51.0
  • Deployment: docker-compose (single backend container)
  • DB: PostgreSQL 15
  • Redis: enabled (REDIS=true)
  • Auth: oauth2-proxy / keycloak

Repro

  1. Create or already-have an interval workflow (interval > 0, e.g. a daily report workflow).
  2. Cause its latest workflowexecution to be left at status='in_progress' (e.g. backend
    container crash / OOM kill / network blip during execution, before INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT
    elapses, or any path that leaves the row not finished and not timed out yet — in our case
    we don't yet know the exact crash trigger, but the row simply stays in_progress forever
    because nothing updates it).
  3. Observe backend logs:
    <ts> DEBUG keep.api.core.db: Workflow <id> is already running by someone else
    <ts> DEBUG keep.api.core.db: Workflow <id> is already running by someone else
    ... (repeats every scheduler iteration, ~1/s)
    
  4. Send alert events to other (event-triggered) workflows. Observe that:
    • alert table receives the rows.
    • No corresponding workflowexecution row is created for the event workflows that
      would otherwise have matched.
  5. UPDATE workflow SET is_disabled = true WHERE id = '<stuck-id>'; — spam stops within ~1
    minute and event dispatch resumes.

Evidence from our incident

-- 77 alert events with no workflowexecution after them, across ~3 fingerprints
SELECT a.fingerprint, COUNT(*) AS alerts_in,
       SUM(CASE WHEN we.id IS NULL THEN 1 ELSE 0 END) AS missed
  FROM alert a
  LEFT JOIN workflowexecution we
    ON we.triggered_by LIKE 'type:alert%' AND we.started >= a.timestamp
                                          AND we.started <= a.timestamp + interval '60s'
 WHERE a.timestamp >= '2026-05-25 18:00:00'
 GROUP BY 1;

(numbers from our DB; redacted)

Root-cause analysis

The scheduler main loop is strictly serial:

https://github.com/keephq/keep/blob//keep/workflowmanager/workflowscheduler.py#L648

while not self._stop:
    self._handle_interval_workflows()
    self._handle_event_workflows()
    time.sleep(1)

_handle_interval_workflows() calls get_workflows_that_should_run():

https://github.com/keephq/keep/blob//keep/api/core/db.py#L319

For every interval workflow it does:

  • get_last_completed_execution(...)
  • if there's an ongoing_execution.status == 'in_progress' and it hasn't passed
    INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT yet, it just logs is already running by someone else
    and continues.

If INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT is long (or never reached because of clock/state),
the row stays in_progress forever and every scheduler iteration re-runs the same N SQL
statements per interval workflow
. Under our connection pool / DB latency, this stretches a
single iteration enough that _handle_event_workflows() falls far behind, and (we suspect)
the workflows_to_run queue contention or DB pool starvation causes event workflow dispatch
to silently drop events.

What we'd like to see

  1. Decouple interval and event scheduling, e.g. run them on independent threads / tasks,
    so a slow/stuck interval workflow cannot starve event dispatch.
  2. Self-heal stuck in_progress rows more aggressively (e.g. shorter default
    INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT, or a watchdog that finishes orphans on backend
    startup based on container restart time).
  3. Rate-limit the is already running log (one per workflow per N minutes is enough).
  4. Optional: expose a /healthz / metric that surfaces "stuck interval workflows count" so
    ops can alert on it.

Workaround

UPDATE workflow         SET is_disabled = true                        WHERE id = '<stuck-id>';
UPDATE workflowexecution SET is_running = 0, status='error',
                             error='manual cleanup of stuck execution'
 WHERE workflow_id = '<stuck-id>' AND is_running = 1;

Spam stops within ~1 minute, event-workflow dispatch resumes immediately.

Additional notes

  • arq worker / process_event_task itself completes (we see it writing to alert and
    alertdeduplicationrule); the missing piece is the workflow scheduler picking the alert
    up from the in-memory workflows_to_run queue.
  • Default deduplication rule fires normally during the incident, ruling out dedup as the
    cause of the missing dispatches.

Happy to share more logs / SQL output if helpful.


Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions