🐛 Bug: Alert is persisted and audited as triggered, but matching event workflow is not executed


## Summary

A single interval workflow whose latest `workflowexecution.status` is stuck at `in_progress`
causes the scheduler to spam `Workflow <id> is already running by someone else` and, as a side
effect, **event-driven (alert) workflows stop being dispatched** even though alerts continue to
be ingested into the `alert` table.

In our case ~57,000 spam log lines were produced in ~32 hours and at least 77 alert events
(across multiple workflows / fingerprints) were ingested but never produced a corresponding
`workflowexecution` row.

## Environment

- Keep version: `v0.51.0`
- Deployment: docker-compose (single backend container)
- DB: PostgreSQL 15
- Redis: enabled (`REDIS=true`)
- Auth: oauth2-proxy / keycloak

## Repro

1. Create or already-have an interval workflow (`interval > 0`, e.g. a daily report workflow).
2. Cause its latest `workflowexecution` to be left at `status='in_progress'` (e.g. backend
   container crash / OOM kill / network blip during execution, before `INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT`
   elapses, or any path that leaves the row not finished and not timed out yet — in our case
   we don't yet know the exact crash trigger, but the row simply stays `in_progress` forever
   because nothing updates it).
3. Observe backend logs:
   ```
   <ts> DEBUG keep.api.core.db: Workflow <id> is already running by someone else
   <ts> DEBUG keep.api.core.db: Workflow <id> is already running by someone else
   ... (repeats every scheduler iteration, ~1/s)
   ```
4. Send alert events to other (event-triggered) workflows. Observe that:
   - `alert` table receives the rows.
   - **No corresponding `workflowexecution` row is created for the event workflows** that
     would otherwise have matched.
5. `UPDATE workflow SET is_disabled = true WHERE id = '<stuck-id>';` — spam stops within ~1
   minute and event dispatch resumes.

## Evidence from our incident

```sql
-- 77 alert events with no workflowexecution after them, across ~3 fingerprints
SELECT a.fingerprint, COUNT(*) AS alerts_in,
       SUM(CASE WHEN we.id IS NULL THEN 1 ELSE 0 END) AS missed
  FROM alert a
  LEFT JOIN workflowexecution we
    ON we.triggered_by LIKE 'type:alert%' AND we.started >= a.timestamp
                                          AND we.started <= a.timestamp + interval '60s'
 WHERE a.timestamp >= '2026-05-25 18:00:00'
 GROUP BY 1;
```

(numbers from our DB; redacted)

## Root-cause analysis

The scheduler main loop is strictly serial:

https://github.com/keephq/keep/blob/<commit>/keep/workflowmanager/workflowscheduler.py#L648

```python
while not self._stop:
    self._handle_interval_workflows()
    self._handle_event_workflows()
    time.sleep(1)
```

`_handle_interval_workflows()` calls `get_workflows_that_should_run()`:

https://github.com/keephq/keep/blob/<commit>/keep/api/core/db.py#L319

For every interval workflow it does:
- `get_last_completed_execution(...)`
- if there's an `ongoing_execution.status == 'in_progress'` and it hasn't passed
  `INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT` yet, it just logs `is already running by someone else`
  and `continue`s.

If `INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT` is long (or never reached because of clock/state),
the row stays `in_progress` forever and **every scheduler iteration re-runs the same N SQL
statements per interval workflow**. Under our connection pool / DB latency, this stretches a
single iteration enough that `_handle_event_workflows()` falls far behind, and (we suspect)
the `workflows_to_run` queue contention or DB pool starvation causes event workflow dispatch
to silently drop events.

## What we'd like to see

1. **Decouple interval and event scheduling**, e.g. run them on independent threads / tasks,
   so a slow/stuck interval workflow cannot starve event dispatch.
2. **Self-heal stuck `in_progress` rows** more aggressively (e.g. shorter default
   `INTERVAL_WORKFLOWS_RELAUNCH_TIMEOUT`, or a watchdog that finishes orphans on backend
   startup based on container restart time).
3. **Rate-limit the `is already running` log** (one per workflow per N minutes is enough).
4. Optional: expose a `/healthz` / metric that surfaces "stuck interval workflows count" so
   ops can alert on it.

## Workaround

```sql
UPDATE workflow         SET is_disabled = true                        WHERE id = '<stuck-id>';
UPDATE workflowexecution SET is_running = 0, status='error',
                             error='manual cleanup of stuck execution'
 WHERE workflow_id = '<stuck-id>' AND is_running = 1;
```

Spam stops within ~1 minute, event-workflow dispatch resumes immediately.

## Additional notes

- `arq` worker / `process_event_task` itself completes (we see it writing to `alert` and
  `alertdeduplicationrule`); the missing piece is the **workflow scheduler** picking the alert
  up from the in-memory `workflows_to_run` queue.
- Default deduplication rule fires normally during the incident, ruling out dedup as the
  cause of the missing dispatches.

Happy to share more logs / SQL output if helpful.
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bug: Alert is persisted and audited as triggered, but matching event workflow is not executed #6531

Summary

Environment

Repro

Evidence from our incident

Root-cause analysis

What we'd like to see

Workaround

Additional notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

🐛 Bug: Alert is persisted and audited as triggered, but matching event workflow is not executed #6531

Description

Summary

Environment

Repro

Evidence from our incident

Root-cause analysis

What we'd like to see

Workaround

Additional notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions