Skip to content

Test: handle analysis/annotation worker OOM + crash auto-restart (perma-fail, memory cap, pause brake) #1620

Description

@davmlaw

🤖 Written by Claude

Background

The analysis worker dispatcher auto-restarts terminated node jobs. A node whose query is large enough to crash the worker host (e.g. list(qs.values_list("pk", flat=True)) on 7M variants) would get re-dispatched after the box recovered, repeatedly crashing it. Statement timeouts didn't catch it (the query streams fine; the worker OOMs materialising the list) and RDS stayed up.

This issue tracks testing of the mitigations added for this. The underlying list(...) bug is already fixed separately.

Changes to test

1. Distinguish worker-died-mid-load from benign worker death (analysis/tasks/analysis_update_tasks.py, lease_ready_nodes)

  • A lapsed lease on a node that committed LOADING/LOADING_CACHE (a worker started executing it and never returned) is perma-failed (ERROR, not re-dispatched) + reported to Rollbar via report_message.
  • A lapsed lease on a QUEUED node (worker died before starting it — deploy/restart) still re-leases as before (self-heal preserved).

2. Worker memory cap → catch OOM in Python (variantgrid/celery.py, worker_process_init)

  • RLIMIT_AS set per-queue via CELERY_WORKER_ADDRESS_SPACE_LIMIT_GB (dict), applied only to workers dedicated to capped queues (so VEP bulk inserts on annotation_workers are never throttled). Default {"analysis_workers": 8}.
  • A runaway allocation now raises a catchable MemoryError; update_node_task catches it, perma-fails the node (ERROR, terminal), and raises NodeOutOfMemoryException so the task_failure handler reports it to Rollbar with analysis/node context.

3. Global pause brake (snpdb.models.JobsControl singleton, migration 0189_jobscontrol)

  • When paused, lease_ready_nodes, reschedule_stalled_analyses, and annotation dispatch_annotation_runs no-op (in-flight work untouched; resume picks it back up).
  • Management command: manage.py jobs_control {pause|resume|status} (--reason on pause).

4. Auto-pause after host reboot (snpdb/signals/jobs_autopause.py, worker_ready)

  • On long-lived hosts, low /proc/uptime ⇒ the box rebooted (a normal deploy leaves uptime high). Pauses once per boot, keyed on /proc/stat btime so concurrent workers / restarts don't re-trip it and it survives an admin resume.
  • Controlled by JOBS_AUTOPAUSE_ON_REBOOT (default on) / JOBS_AUTOPAUSE_ON_REBOOT_UPTIME_SECS (600). Turn off on ephemeral/autoscaled hosts.

Test checklist

  • Run migration migrate snpdb (creates JobsControl).
  • A node that OOMs (exceeds the analysis_workers RLIMIT_AS) → node ends ERROR, box stays up, NodeOutOfMemoryException appears in Rollbar, node is not re-run.
  • Kill an analysis_worker mid-load (LOADING) → node perma-fails (ERROR) + Rollbar message; it is not re-dispatched.
  • Kill a worker while a node is QUEUED (not yet started) → node re-leases and completes (self-heal still works).
  • analysis_workers dedicated worker boots with the cap; combined/annotation_workers/default workers are not capped (confirm VEP 25k-record bulk inserts still load).
  • Confirm 8 GB is appropriate — inspect real analysis_workers VSZ under load and tune.
  • jobs_control pause → no new analysis/annotation work leased or launched; in-flight finishes; status shows paused.
  • jobs_control resume → dispatchers pick work back up.
  • Reboot a worker host → jobs auto-pause once; jobs_control resume clears it; a normal deploy (no reboot) does not pause.

Possible follow-ups (out of scope, not yet built)

  • Crash-loop detector (Redis restart counter w/ TTL) tripping the same JobsControl pause — covers process crash-loops that don't reboot the box, which /proc/uptime misses.
  • If task_acks_late is ever enabled, add an is_paused() guard at the top of update_node_task (a redelivered task would otherwise bypass the dispatcher guards).
  • Optional lease heartbeat so a legitimately slow (>LEASE_SECONDS) load isn't mistaken for a mid-load death.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions