Test: handle analysis/annotation worker OOM + crash auto-restart (perma-fail, memory cap, pause brake)

🤖 Written by Claude

## Background

The analysis worker dispatcher auto-restarts terminated node jobs. A node whose query is large enough to crash the worker host (e.g. `list(qs.values_list("pk", flat=True))` on 7M variants) would get re-dispatched after the box recovered, repeatedly crashing it. Statement timeouts didn't catch it (the query streams fine; the **worker** OOMs materialising the list) and RDS stayed up.

This issue tracks **testing** of the mitigations added for this. The underlying `list(...)` bug is already fixed separately.

## Changes to test

**1. Distinguish worker-died-mid-load from benign worker death** (`analysis/tasks/analysis_update_tasks.py`, `lease_ready_nodes`)
- A lapsed lease on a node that committed `LOADING`/`LOADING_CACHE` (a worker started executing it and never returned) is **perma-failed** (`ERROR`, not re-dispatched) + reported to Rollbar via `report_message`.
- A lapsed lease on a `QUEUED` node (worker died before starting it — deploy/restart) still re-leases as before (self-heal preserved).

**2. Worker memory cap → catch OOM in Python** (`variantgrid/celery.py`, `worker_process_init`)
- `RLIMIT_AS` set per-queue via `CELERY_WORKER_ADDRESS_SPACE_LIMIT_GB` (dict), applied **only to workers dedicated to capped queues** (so VEP bulk inserts on `annotation_workers` are never throttled). Default `{"analysis_workers": 8}`.
- A runaway allocation now raises a catchable `MemoryError`; `update_node_task` catches it, perma-fails the node (`ERROR`, terminal), and raises `NodeOutOfMemoryException` so the `task_failure` handler reports it to Rollbar with analysis/node context.

**3. Global pause brake** (`snpdb.models.JobsControl` singleton, migration `0189_jobscontrol`)
- When paused, `lease_ready_nodes`, `reschedule_stalled_analyses`, and annotation `dispatch_annotation_runs` no-op (in-flight work untouched; resume picks it back up).
- Management command: `manage.py jobs_control {pause|resume|status}` (`--reason` on pause).

**4. Auto-pause after host reboot** (`snpdb/signals/jobs_autopause.py`, `worker_ready`)
- On long-lived hosts, low `/proc/uptime` ⇒ the box rebooted (a normal deploy leaves uptime high). Pauses once per boot, keyed on `/proc/stat` btime so concurrent workers / restarts don't re-trip it and it survives an admin resume.
- Controlled by `JOBS_AUTOPAUSE_ON_REBOOT` (default on) / `JOBS_AUTOPAUSE_ON_REBOOT_UPTIME_SECS` (600). Turn off on ephemeral/autoscaled hosts.

## Test checklist

- [ ] Run migration `migrate snpdb` (creates `JobsControl`).
- [ ] A node that OOMs (exceeds the `analysis_workers` `RLIMIT_AS`) → node ends `ERROR`, box stays up, `NodeOutOfMemoryException` appears in Rollbar, node is **not** re-run.
- [ ] Kill an `analysis_worker` mid-load (LOADING) → node perma-fails (`ERROR`) + Rollbar message; it is not re-dispatched.
- [ ] Kill a worker while a node is `QUEUED` (not yet started) → node re-leases and completes (self-heal still works).
- [ ] `analysis_workers` dedicated worker boots with the cap; combined/`annotation_workers`/default workers are **not** capped (confirm VEP 25k-record bulk inserts still load).
- [ ] Confirm 8 GB is appropriate — inspect real `analysis_workers` VSZ under load and tune.
- [ ] `jobs_control pause` → no new analysis/annotation work leased or launched; in-flight finishes; `status` shows paused.
- [ ] `jobs_control resume` → dispatchers pick work back up.
- [ ] Reboot a worker host → jobs auto-pause once; `jobs_control resume` clears it; a normal deploy (no reboot) does **not** pause.

## Possible follow-ups (out of scope, not yet built)

- Crash-loop detector (Redis restart counter w/ TTL) tripping the same `JobsControl` pause — covers process crash-loops that don't reboot the box, which `/proc/uptime` misses.
- If `task_acks_late` is ever enabled, add an `is_paused()` guard at the top of `update_node_task` (a redelivered task would otherwise bypass the dispatcher guards).
- Optional lease heartbeat so a legitimately slow (>`LEASE_SECONDS`) load isn't mistaken for a mid-load death.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test: handle analysis/annotation worker OOM + crash auto-restart (perma-fail, memory cap, pause brake) #1620

Background

Changes to test

Test checklist

Possible follow-ups (out of scope, not yet built)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Test: handle analysis/annotation worker OOM + crash auto-restart (perma-fail, memory cap, pause brake) #1620

Description

Background

Changes to test

Test checklist

Possible follow-ups (out of scope, not yet built)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions