fix: guard supervisor stale-job reconciliation#5
Conversation
Change-Id: I12283d663097ff6a6b0afb6447511bbab73dd404
Reviewer's GuideAdds per-supervisor instance ownership metadata to job records and updates supervisor status reconciliation so that running jobs owned by another live supervisor process are not incorrectly marked as failed, including persistence in the session store and tests for the new behaviour. Sequence diagram for updated supervisor status reconciliationsequenceDiagram
participant Supervisor
participant SessionStore
participant Helper as _owned_by_foreign_live_supervisor
participant PidCheck as _pid_exists
participant OS
Supervisor->>SessionStore: get_job(job_id)
SessionStore-->>Supervisor: JobRecord(record)
alt record is None
Supervisor-->>Supervisor: return None
else record.status != running
Supervisor-->>Supervisor: return _public_record(record)
else record.status == running
Supervisor->>Helper: _owned_by_foreign_live_supervisor(record, _instance_id)
Helper->>PidCheck: _pid_exists(owner.pid)
PidCheck->>OS: os.kill(pid, 0)
OS-->>PidCheck: result
PidCheck-->>Helper: bool
Helper-->>Supervisor: bool
alt owned by foreign live supervisor
Supervisor-->>Supervisor: return _public_record(record)
else not owned or owner dead
Supervisor->>SessionStore: finalize_job(job_id, status=failed, error="reconciled by supervisor")
SessionStore-->>Supervisor: JobRecord(finalised)
Supervisor-->>Supervisor: return _public_record(finalised)
end
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthroughJobRecord storage is extended to accept and persist ChangesCross-instance supervisor job ownership and reconciliation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
Supervisor.start, you callos.getpid()twice when persisting the job record; consider capturing it once in a local variable and reusing it to avoid tiny inconsistencies and make the intent clearer. - The liveness check in
_owned_by_foreign_live_supervisorrelies solely on PID and instance ID; you may want to consider mitigating PID reuse issues (e.g., by also recording a start timestamp or monotonic counter) so that very long-lived records are not incorrectly treated as owned by a new, unrelated process that happens to reuse the same PID.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `Supervisor.start`, you call `os.getpid()` twice when persisting the job record; consider capturing it once in a local variable and reusing it to avoid tiny inconsistencies and make the intent clearer.
- The liveness check in `_owned_by_foreign_live_supervisor` relies solely on PID and instance ID; you may want to consider mitigating PID reuse issues (e.g., by also recording a start timestamp or monotonic counter) so that very long-lived records are not incorrectly treated as owned by a new, unrelated process that happens to reuse the same PID.
## Individual Comments
### Comment 1
<location path="src/agy_mcp/supervisor.py" line_range="350-354" />
<code_context>
cwd=self._response_cwd(effective_request.cwd),
request=_serialise_request(effective_request, self.safety),
backend=backend_name,
+ pid=os.getpid(),
+ extra={
+ "supervisor": {
+ "pid": os.getpid(),
+ "instance_id": self._instance_id,
+ }
+ },
</code_context>
<issue_to_address>
**question (bug_risk):** Clarify whether the stored `pid` is meant to be the worker/job pid or the supervisor process pid.
In `start()`, both the `pid` passed to `create_job` and `extra["supervisor"]["pid"]` use `os.getpid()`, so both currently store the supervisor’s pid. If any code later treats `record.pid` as the job/worker subprocess pid (e.g. for monitoring or cleanup), this mismatch could cause subtle bugs. Please either rename the field to make ownership explicit (e.g. `supervisor_pid`) or ensure `pid` consistently refers to the same process type across the codebase.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52c6eb9114
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if owner.get("instance_id") == current_instance_id: | ||
| return False | ||
| pid = owner.get("pid") | ||
| return isinstance(pid, int) and _pid_exists(pid) |
There was a problem hiding this comment.
Guard against PID reuse before trusting foreign owners
When a supervisor exits before finalizing a job, the OS can later reuse its PID for an unrelated process; with this check, a different supervisor will see os.kill(pid, 0) succeed and keep the stale record in running forever instead of reconciling it. This affects stores that are polled after a supervisor crash/restart on a busy host, so the owner metadata needs an additional identity check such as process start time/boot id (or another non-reusable token) before treating the recorded owner as live.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR adds supervisor ownership metadata to persisted job records so stale-job reconciliation does not incorrectly fail jobs owned by another live supervisor.
Changes:
- Persists supervisor PID and instance ID when creating jobs.
- Skips reconciliation for running jobs owned by a foreign live supervisor.
- Adds tests for owner-aware reconciliation and session-store metadata round-tripping.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/agy_mcp/supervisor.py |
Adds supervisor instance IDs, persists owner metadata, and guards reconciliation with a PID liveness check. |
src/agy_mcp/session_store.py |
Extends create_job to persist optional pid and extra metadata. |
tests/test_supervisor.py |
Adds tests for live/dead foreign supervisor ownership behavior. |
tests/test_session_store.py |
Verifies pid and supervisor metadata persist through create/get round trip. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return isinstance(pid, int) and _pid_exists(pid) | ||
|
|
||
|
|
Change-Id: Id1eb24c1664d0cab79965179dab1cbbbbea010de
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb4477543b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return None | ||
| try: | ||
| result = subprocess.run( | ||
| ["ps", "-o", "lstart=", "-p", str(pid)], |
There was a problem hiding this comment.
Use a timezone-stable process signature
When two supervisors share a store but run with different TZ environments, ps -o lstart= formats the same process start time in the caller's local timezone; I checked locally that TZ=UTC ps -o lstart= -p $$ and TZ=America/Los_Angeles ... produce different hours for the same PID. In that case a second supervisor polling a live foreign job compares a different string, treats the owner PID as reused, and finalizes the still-running job as failed.
Useful? React with 👍 / 👎.
Change-Id: Iaf6394ae5ea5a148f13cdb3059112a418ff8c01d
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c854b3ce8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return None | ||
| try: | ||
| result = subprocess.run( | ||
| ["ps", "-o", "lstart=", "-p", str(pid)], |
There was a problem hiding this comment.
Use a higher-resolution process identity
When a stale supervisor PID is recycled within the same second, this signature still matches because ps -o lstart contains only whole-second precision (I checked ps -o lstart= -p $$, which prints a value like Wed May 27 09:14:40 2026). The new evidence relative to the earlier PID-reuse comment is that this commit’s identity token is that second-resolution lstart string, so a fast PID reuse in a busy container/host can still make _pid_matches_start_signature return true and leave the stale job running indefinitely; use a non-reusable or higher-resolution identity such as /proc/<pid>/stat start ticks plus boot id where available.
Useful? React with 👍 / 👎.
Change-Id: I6827605a7d0b0c779999ca63f7711dfb2d44061b
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73161a7adf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if pid <= 0: | ||
| return False | ||
| try: | ||
| os.kill(pid, 0) |
There was a problem hiding this comment.
Stop using os.kill(pid, 0) on Windows
When this package runs on Windows (which the docs advertise as supported), a second supervisor polling a foreign running job without a usable start signature reaches this liveness probe; Python’s Windows os.kill implementation treats any signal other than CTRL_C/CTRL_BREAK as TerminateProcess, so os.kill(pid, 0) can terminate the live owner supervisor instead of just checking whether it exists. Use a non-destructive Windows liveness check such as opening the process and querying its exit code (or otherwise gate this POSIX probe to POSIX platforms).
Useful? React with 👍 / 👎.
Change-Id: If2e16fbde1cabae8e05bfdbad6701a54d79ec7fc
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d57758b63
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return False | ||
| try: | ||
| kernel32 = ctypes.windll.kernel32 | ||
| handle = kernel32.OpenProcess(0x1000, False, pid) |
There was a problem hiding this comment.
Preserve HANDLE width in Windows PID probe
On 64-bit Windows, ctypes.windll defaults function return types to 32-bit c_int, but OpenProcess returns a pointer-sized HANDLE. When a valid handle has high bits set, this truncates the value passed to GetExitCodeProcess/CloseHandle, so _windows_pid_exists() can report a live foreign supervisor as missing and status() will finalize that still-running job as failed; set the OpenProcess restype/argtypes with ctypes.wintypes.HANDLE/DWORD before using the handle.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agy_mcp/supervisor.py (1)
470-485:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAvoid blocking foreign-owner probes while holding the supervisor lock.
Line [470] holds
self._lock, and Line [484] can trigger process probes that may block (includingsubprocess.run(..., timeout=1)). This can stallstart()/cancel()and other status calls behind a single global lock under load.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agy_mcp/supervisor.py` around lines 470 - 485, The supervisor currently holds self._lock while calling a potentially blocking foreign-owner probe via _owned_by_foreign_live_supervisor(fresh, self._instance_id), which can stall other operations; fix by minimizing lock scope: inside the with self._lock block, read and cache handle, fresh = self.store.get_job(job_id) and any simple flags (e.g., fresh.status and an indicator that ownership must be probed) then release the lock and only then call the blocking probe function _owned_by_foreign_live_supervisor(fresh, self._instance_id); after the probe re-acquire the lock if you need to make a final decision or return a consistent _public_record(fresh) (or re-read via store.get_job if necessary) so no long-running blocking call happens while holding self._lock.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/agy_mcp/supervisor.py`:
- Around line 829-849: _windows_pid_exists currently calls
kernel32.OpenProcess/GetExitCodeProcess/CloseHandle without declaring ctypes
signatures, which can truncate HANDLEs on 64-bit Windows; fix by importing
ctypes.wintypes and explicitly setting kernel32.OpenProcess.argtypes =
(wintypes.DWORD, wintypes.BOOL, wintypes.DWORD) and kernel32.OpenProcess.restype
= wintypes.HANDLE, kernel32.GetExitCodeProcess.argtypes = (wintypes.HANDLE,
ctypes.POINTER(wintypes.DWORD)) and restype = wintypes.BOOL, and
kernel32.CloseHandle.argtypes = (wintypes.HANDLE,) and restype = wintypes.BOOL
in the _windows_pid_exists function before calling these APIs (also treat a
returned handle of 0/NULL as failure and use a wintypes.DWORD for exit_code).
---
Outside diff comments:
In `@src/agy_mcp/supervisor.py`:
- Around line 470-485: The supervisor currently holds self._lock while calling a
potentially blocking foreign-owner probe via
_owned_by_foreign_live_supervisor(fresh, self._instance_id), which can stall
other operations; fix by minimizing lock scope: inside the with self._lock
block, read and cache handle, fresh = self.store.get_job(job_id) and any simple
flags (e.g., fresh.status and an indicator that ownership must be probed) then
release the lock and only then call the blocking probe function
_owned_by_foreign_live_supervisor(fresh, self._instance_id); after the probe
re-acquire the lock if you need to make a final decision or return a consistent
_public_record(fresh) (or re-read via store.get_job if necessary) so no
long-running blocking call happens while holding self._lock.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 35dd33d0-caed-4dcf-8bcc-83c69d6f01ca
📒 Files selected for processing (4)
src/agy_mcp/session_store.pysrc/agy_mcp/supervisor.pytests/test_session_store.pytests/test_supervisor.py
| def _windows_pid_exists(pid: int) -> bool: | ||
| """Return whether ``pid`` exists on Windows without signalling it.""" | ||
|
|
||
| try: | ||
| import ctypes | ||
| except ImportError: | ||
| return False | ||
| try: | ||
| kernel32 = ctypes.windll.kernel32 | ||
| handle = kernel32.OpenProcess(0x1000, False, pid) | ||
| if not handle: | ||
| return False | ||
| try: | ||
| exit_code = ctypes.c_ulong() | ||
| if not kernel32.GetExitCodeProcess(handle, ctypes.byref(exit_code)): | ||
| return False | ||
| return exit_code.value == 259 | ||
| finally: | ||
| kernel32.CloseHandle(handle) | ||
| except (AttributeError, OSError): | ||
| return False |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify current Windows PID probe interop setup.
# Expected: WinAPI calls appear without explicit argtypes/restype declarations.
rg -n -C4 'def _windows_pid_exists|OpenProcess|GetExitCodeProcess|CloseHandle|ctypes\.windll|ctypes\.WinDLL|argtypes|restype' src/agy_mcp/supervisor.pyRepository: Boulea7/agy-mcp
Length of output: 882
🏁 Script executed:
#!/bin/bash
rg -n 'argtypes|restype|ctypes\.WinDLL|wintypes' src/agy_mcp/supervisor.pyRepository: Boulea7/agy-mcp
Length of output: 41
Define explicit ctypes WinAPI signatures to avoid HANDLE truncation on Windows.
_windows_pid_exists calls OpenProcess / GetExitCodeProcess / CloseHandle via ctypes.windll.kernel32 without setting argtypes/restype. On 64-bit Windows, ctypes’ default restype/argument conversions can truncate HANDLE values, causing false negatives or invalid handle use.
Suggested fix
def _windows_pid_exists(pid: int) -> bool:
"""Return whether ``pid`` exists on Windows without signalling it."""
try:
- import ctypes
+ import ctypes
+ from ctypes import wintypes
except ImportError:
return False
try:
- kernel32 = ctypes.windll.kernel32
- handle = kernel32.OpenProcess(0x1000, False, pid)
+ kernel32 = ctypes.WinDLL("kernel32", use_last_error=True)
+ kernel32.OpenProcess.argtypes = [wintypes.DWORD, wintypes.BOOL, wintypes.DWORD]
+ kernel32.OpenProcess.restype = wintypes.HANDLE
+ kernel32.GetExitCodeProcess.argtypes = [wintypes.HANDLE, ctypes.POINTER(wintypes.DWORD)]
+ kernel32.GetExitCodeProcess.restype = wintypes.BOOL
+ kernel32.CloseHandle.argtypes = [wintypes.HANDLE]
+ kernel32.CloseHandle.restype = wintypes.BOOL
+
+ handle = kernel32.OpenProcess(0x1000, False, pid)
if not handle:
return False
try:
- exit_code = ctypes.c_ulong()
+ exit_code = wintypes.DWORD()
if not kernel32.GetExitCodeProcess(handle, ctypes.byref(exit_code)):
return False
return exit_code.value == 259
finally:
kernel32.CloseHandle(handle)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _windows_pid_exists(pid: int) -> bool: | |
| """Return whether ``pid`` exists on Windows without signalling it.""" | |
| try: | |
| import ctypes | |
| except ImportError: | |
| return False | |
| try: | |
| kernel32 = ctypes.windll.kernel32 | |
| handle = kernel32.OpenProcess(0x1000, False, pid) | |
| if not handle: | |
| return False | |
| try: | |
| exit_code = ctypes.c_ulong() | |
| if not kernel32.GetExitCodeProcess(handle, ctypes.byref(exit_code)): | |
| return False | |
| return exit_code.value == 259 | |
| finally: | |
| kernel32.CloseHandle(handle) | |
| except (AttributeError, OSError): | |
| return False | |
| def _windows_pid_exists(pid: int) -> bool: | |
| """Return whether ``pid`` exists on Windows without signalling it.""" | |
| try: | |
| import ctypes | |
| from ctypes import wintypes | |
| except ImportError: | |
| return False | |
| try: | |
| kernel32 = ctypes.WinDLL("kernel32", use_last_error=True) | |
| kernel32.OpenProcess.argtypes = [wintypes.DWORD, wintypes.BOOL, wintypes.DWORD] | |
| kernel32.OpenProcess.restype = wintypes.HANDLE | |
| kernel32.GetExitCodeProcess.argtypes = [wintypes.HANDLE, ctypes.POINTER(wintypes.DWORD)] | |
| kernel32.GetExitCodeProcess.restype = wintypes.BOOL | |
| kernel32.CloseHandle.argtypes = [wintypes.HANDLE] | |
| kernel32.CloseHandle.restype = wintypes.BOOL | |
| handle = kernel32.OpenProcess(0x1000, False, pid) | |
| if not handle: | |
| return False | |
| try: | |
| exit_code = wintypes.DWORD() | |
| if not kernel32.GetExitCodeProcess(handle, ctypes.byref(exit_code)): | |
| return False | |
| return exit_code.value == 259 | |
| finally: | |
| kernel32.CloseHandle(handle) | |
| except (AttributeError, OSError): | |
| return False |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/agy_mcp/supervisor.py` around lines 829 - 849, _windows_pid_exists
currently calls kernel32.OpenProcess/GetExitCodeProcess/CloseHandle without
declaring ctypes signatures, which can truncate HANDLEs on 64-bit Windows; fix
by importing ctypes.wintypes and explicitly setting
kernel32.OpenProcess.argtypes = (wintypes.DWORD, wintypes.BOOL, wintypes.DWORD)
and kernel32.OpenProcess.restype = wintypes.HANDLE,
kernel32.GetExitCodeProcess.argtypes = (wintypes.HANDLE,
ctypes.POINTER(wintypes.DWORD)) and restype = wintypes.BOOL, and
kernel32.CloseHandle.argtypes = (wintypes.HANDLE,) and restype = wintypes.BOOL
in the _windows_pid_exists function before calling these APIs (also treat a
returned handle of 0/NULL as failure and use a wintypes.DWORD for exit_code).
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/agy_mcp/supervisor.py">
<violation number="1" location="src/agy_mcp/supervisor.py:839">
P1: `_windows_pid_exists` conflates permission failures with non-existent PIDs, which can misclassify live processes as dead during stale-job reconciliation on Windows.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| try: | ||
| kernel32 = ctypes.windll.kernel32 | ||
| handle = kernel32.OpenProcess(0x1000, False, pid) | ||
| if not handle: |
There was a problem hiding this comment.
P1: _windows_pid_exists conflates permission failures with non-existent PIDs, which can misclassify live processes as dead during stale-job reconciliation on Windows.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/agy_mcp/supervisor.py, line 839:
<comment>`_windows_pid_exists` conflates permission failures with non-existent PIDs, which can misclassify live processes as dead during stale-job reconciliation on Windows.</comment>
<file context>
@@ -818,6 +826,29 @@ def _pid_exists(pid: int) -> bool:
+ try:
+ kernel32 = ctypes.windll.kernel32
+ handle = kernel32.OpenProcess(0x1000, False, pid)
+ if not handle:
+ return False
+ try:
</file context>
Summary
Validation
Summary by Sourcery
Guard supervisor stale-job reconciliation by tracking job ownership per supervisor instance and respecting live foreign owners.
New Features:
Bug Fixes:
Enhancements:
Tests:
Summary by cubic
Prevents the supervisor from marking another live supervisor’s running job as failed. We now verify ownership with a high‑resolution, timezone-stable process signature and respect foreign live owners.
pid,extra.supervisor.instance_id,extra.supervisor.process_start_signature); extendSessionStore.create_jobto acceptpidandextra./procboot-id + start ticks; fallback tops lstartwithLC_ALL=C,TZ=UTC).Supervisor.statusrespects foreign live owners (verifies PID liveness and signature; reconciles only when dead or mismatched) and uses a non-destructive Windows PID probe.Written for commit 7d57758. Summary will update on new commits. Review in cubic
Summary by CodeRabbit
Release Notes
New Features
Tests