🤖 Written by Claude
Background
The analysis worker dispatcher auto-restarts terminated node jobs. A node whose query is large enough to crash the worker host (e.g. list(qs.values_list("pk", flat=True)) on 7M variants) would get re-dispatched after the box recovered, repeatedly crashing it. Statement timeouts didn't catch it (the query streams fine; the worker OOMs materialising the list) and RDS stayed up.
This issue tracks testing of the mitigations added for this. The underlying list(...) bug is already fixed separately.
Changes to test
1. Distinguish worker-died-mid-load from benign worker death (analysis/tasks/analysis_update_tasks.py, lease_ready_nodes)
- A lapsed lease on a node that committed
LOADING/LOADING_CACHE (a worker started executing it and never returned) is perma-failed (ERROR, not re-dispatched) + reported to Rollbar via report_message.
- A lapsed lease on a
QUEUED node (worker died before starting it — deploy/restart) still re-leases as before (self-heal preserved).
2. Worker memory cap → catch OOM in Python (variantgrid/celery.py, worker_process_init)
RLIMIT_AS set per-queue via CELERY_WORKER_ADDRESS_SPACE_LIMIT_GB (dict), applied only to workers dedicated to capped queues (so VEP bulk inserts on annotation_workers are never throttled). Default {"analysis_workers": 8}.
- A runaway allocation now raises a catchable
MemoryError; update_node_task catches it, perma-fails the node (ERROR, terminal), and raises NodeOutOfMemoryException so the task_failure handler reports it to Rollbar with analysis/node context.
3. Global pause brake (snpdb.models.JobsControl singleton, migration 0189_jobscontrol)
- When paused,
lease_ready_nodes, reschedule_stalled_analyses, and annotation dispatch_annotation_runs no-op (in-flight work untouched; resume picks it back up).
- Management command:
manage.py jobs_control {pause|resume|status} (--reason on pause).
4. Auto-pause after host reboot (snpdb/signals/jobs_autopause.py, worker_ready)
- On long-lived hosts, low
/proc/uptime ⇒ the box rebooted (a normal deploy leaves uptime high). Pauses once per boot, keyed on /proc/stat btime so concurrent workers / restarts don't re-trip it and it survives an admin resume.
- Controlled by
JOBS_AUTOPAUSE_ON_REBOOT (default on) / JOBS_AUTOPAUSE_ON_REBOOT_UPTIME_SECS (600). Turn off on ephemeral/autoscaled hosts.
Test checklist
Possible follow-ups (out of scope, not yet built)
- Crash-loop detector (Redis restart counter w/ TTL) tripping the same
JobsControl pause — covers process crash-loops that don't reboot the box, which /proc/uptime misses.
- If
task_acks_late is ever enabled, add an is_paused() guard at the top of update_node_task (a redelivered task would otherwise bypass the dispatcher guards).
- Optional lease heartbeat so a legitimately slow (>
LEASE_SECONDS) load isn't mistaken for a mid-load death.
🤖 Written by Claude
Background
The analysis worker dispatcher auto-restarts terminated node jobs. A node whose query is large enough to crash the worker host (e.g.
list(qs.values_list("pk", flat=True))on 7M variants) would get re-dispatched after the box recovered, repeatedly crashing it. Statement timeouts didn't catch it (the query streams fine; the worker OOMs materialising the list) and RDS stayed up.This issue tracks testing of the mitigations added for this. The underlying
list(...)bug is already fixed separately.Changes to test
1. Distinguish worker-died-mid-load from benign worker death (
analysis/tasks/analysis_update_tasks.py,lease_ready_nodes)LOADING/LOADING_CACHE(a worker started executing it and never returned) is perma-failed (ERROR, not re-dispatched) + reported to Rollbar viareport_message.QUEUEDnode (worker died before starting it — deploy/restart) still re-leases as before (self-heal preserved).2. Worker memory cap → catch OOM in Python (
variantgrid/celery.py,worker_process_init)RLIMIT_ASset per-queue viaCELERY_WORKER_ADDRESS_SPACE_LIMIT_GB(dict), applied only to workers dedicated to capped queues (so VEP bulk inserts onannotation_workersare never throttled). Default{"analysis_workers": 8}.MemoryError;update_node_taskcatches it, perma-fails the node (ERROR, terminal), and raisesNodeOutOfMemoryExceptionso thetask_failurehandler reports it to Rollbar with analysis/node context.3. Global pause brake (
snpdb.models.JobsControlsingleton, migration0189_jobscontrol)lease_ready_nodes,reschedule_stalled_analyses, and annotationdispatch_annotation_runsno-op (in-flight work untouched; resume picks it back up).manage.py jobs_control {pause|resume|status}(--reasonon pause).4. Auto-pause after host reboot (
snpdb/signals/jobs_autopause.py,worker_ready)/proc/uptime⇒ the box rebooted (a normal deploy leaves uptime high). Pauses once per boot, keyed on/proc/statbtime so concurrent workers / restarts don't re-trip it and it survives an admin resume.JOBS_AUTOPAUSE_ON_REBOOT(default on) /JOBS_AUTOPAUSE_ON_REBOOT_UPTIME_SECS(600). Turn off on ephemeral/autoscaled hosts.Test checklist
migrate snpdb(createsJobsControl).analysis_workersRLIMIT_AS) → node endsERROR, box stays up,NodeOutOfMemoryExceptionappears in Rollbar, node is not re-run.analysis_workermid-load (LOADING) → node perma-fails (ERROR) + Rollbar message; it is not re-dispatched.QUEUED(not yet started) → node re-leases and completes (self-heal still works).analysis_workersdedicated worker boots with the cap; combined/annotation_workers/default workers are not capped (confirm VEP 25k-record bulk inserts still load).analysis_workersVSZ under load and tune.jobs_control pause→ no new analysis/annotation work leased or launched; in-flight finishes;statusshows paused.jobs_control resume→ dispatchers pick work back up.jobs_control resumeclears it; a normal deploy (no reboot) does not pause.Possible follow-ups (out of scope, not yet built)
JobsControlpause — covers process crash-loops that don't reboot the box, which/proc/uptimemisses.task_acks_lateis ever enabled, add anis_paused()guard at the top ofupdate_node_task(a redelivered task would otherwise bypass the dispatcher guards).LEASE_SECONDS) load isn't mistaken for a mid-load death.