fix(sdlc): provision coord SSOT tree so R2 event-log + R3 escape grant go LIVE#3822
Conversation
…t go LIVE The coordination tree defaulted to /var/lib/hapax/coord — root-owned and unprovisionable by hapax (uid 1000). path.parent.mkdir raised PermissionError (swallowed), so the R2 coord_event_log SSOT never materialized there, the R3 daemon-independent escape grant was INERT (escape_grant_allows always returned 1: no grant dir/key), and coord-grant-mint could not create the dir/key — leaving only the deprecated HAPAX_*_OFF off-switch. Fix (no root, per parent spec remediation #1): make ~/.cache/hapax/coord the canonical default — user-writable, already where the live ledger sits, still one fixed location outside every worktree (NEW-4). Centralized in coord_event_log.py (coord_base_dir / default_grant_dir / default_grant_key) so the Python minter and the bash shim resolve identically. Precedence per surface: explicit grant override -> $HAPAX_COORD_DIR -> $XDG_CACHE_HOME -> ~/.cache. - shared/coord_event_log.py: coord_base_dir resolver + grant helpers; default_event_log() now dynamic; provision_coord_tree() — a daemon-independent provisioner that idempotently materializes {base, spool/, grants/} + the 0600 grant-key and proves the base writable, raising LOUDLY (no silent swallow) when it cannot. Satisfies the fail-loud AC in-scope, without editing the out-of-scope daemon swallow. - scripts/coord-boot-reconcile --provision: boot entry (provision then reconcile); exit 2 on an unwritable SSOT. - systemd/units/hapax-coord-provision.service: oneshot, After=hapax-secrets, `# Hapax-Auto-Enable: true` so the deploy (#3817) enables it on merge. - scripts/coord-grant-mint + hooks/scripts/escape-grant.sh: consume the shared cache default (no more /var/lib). - shared/governance/coord_capabilities.py: load_or_create_key centralized. Live-verified on host: tree + 0600 grant-key provisioned at ~/.cache/hapax/coord; a minted cc-task-gate grant flips escape_grant_allows 1->0 (block->allow) with no env override; the oneshot is installed + enabled. ruff + pyrefly(0 errors) + 348 coord/capability/hook/importer tests green. Task: reform-improve-coord-ssot-provisioning-20260601 AuthorityCase: CASE-SDLC-REFORM-001 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR migrates Hapax coordination storage from hardcoded ChangesCoordination Provisioning and Directory Resolution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e49779d48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| base = coord_base_dir() | ||
| return CoordEventLog( | ||
| db_path=base / "ledger.db", | ||
| jsonl_path=base / "ledger.jsonl", |
There was a problem hiding this comment.
Use the shared coord default for dispatch launches
With the normal mutable dispatch path, scripts/hapax-methodology-dispatch still constructs its injected CoordEventLog from HAPAX_COORD_LEDGER_DB/HAPAX_COORD_JSONL_MIRROR/HAPAX_COORD_SPOOL_DIR with /var/lib/hapax/coord fallbacks (lines 839-852) and passes it into run_atomic_dispatch_launch (line 1384). Since this commit only provisions ~/.cache/hapax/coord, those dispatch events continue to target the old root-owned tree unless three legacy env vars are set, so the core R2 event-log path remains inert in exactly the default environment this change is meant to fix.
Useful? React with 👍 / 👎.
| def default_grant_dir() -> Path: | ||
| """Escape-grant directory: ``HAPAX_COORD_GRANT_DIR`` else ``<base>/grants``.""" | ||
| explicit = os.environ.get(GRANT_DIR_ENV, "").strip() | ||
| return Path(explicit) if explicit else coord_base_dir() / "grants" | ||
|
|
||
|
|
||
| def default_grant_key() -> Path: | ||
| """Escape signing key: ``HAPAX_COORD_GRANT_KEY`` else ``<base>/grant-key``.""" | ||
| explicit = os.environ.get(GRANT_KEY_ENV, "").strip() | ||
| return Path(explicit) if explicit else coord_base_dir() / "grant-key" |
There was a problem hiding this comment.
Keep the retro-grant watcher on the new grant store
When an emergency HAPAX_*_OFF obligation is recorded and the operator follows the updated coord-grant-mint defaults, the grant now lands under ~/.cache/hapax/coord/grants; however scripts/coord-retro-grant-watch still falls back to /var/lib/hapax/coord/grants and /var/lib/hapax/coord/grant-key when no per-script env is set (lines 145-146). In that default setup the watcher never sees the covering grant and escalates an already-fulfilled obligation, so this shared default needs to be used there too.
Useful? React with 👍 / 👎.
| for directory in (base, spool, gdir): | ||
| existed = directory.is_dir() | ||
| try: | ||
| directory.mkdir(parents=True, exist_ok=True) | ||
| except OSError as exc: |
There was a problem hiding this comment.
Materialize the ledger during provisioning
On a fresh host with no spooled events, coord-boot-reconcile --provision only creates the directories and grant key; boot_reconcile() then replays an absent SQLite file and returns without creating either ledger.db or ledger.jsonl. That leaves the post-merge/boot oneshot green while the promised SSOT artifacts are still absent until some later append, so provision should initialize the SQLite schema and mirror file as part of the same boot path.
Useful? React with 👍 / 👎.
| key_existed = gkey.exists() | ||
| try: | ||
| load_or_create_key(gkey) |
There was a problem hiding this comment.
Reject or repair an existing permissive grant key
If grant-key already exists with permissive bits (for example from a manual recovery or a previous bad provision), this path just reads it and reports provisioning success; it never checks or fixes the mode. That leaves the escape-grant signing key world/group-readable while the new boot provisioner claims the key is 0600, so validate the existing file's permissions or chmod it before returning success.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@shared/coord_event_log.py`:
- Around line 664-666: The grant paths should be derived from the resolved base
(the local variable base) when grant_dir or grant_key are not provided;
currently gdir and gkey call default_grant_dir() / default_grant_key() which may
use the process environment/coord_base_dir() instead of the caller-provided
base_dir. Change the logic so that after resolving base = Path(base_dir) if
base_dir is not None else coord_base_dir(), you compute gdir = Path(grant_dir)
if grant_dir is not None else default_grant_dir(base) (or otherwise construct
the grant path relative to base) and likewise gkey = Path(grant_key) if
grant_key is not None else default_grant_key(base); if
default_grant_dir/default_grant_key do not accept a base param, update them or
build the grant paths by joining base with the relative default grant filenames
so the grant artifacts inhabit the same base tree.
In `@shared/governance/coord_capabilities.py`:
- Around line 319-329: The code has a TOCTOU race: after path.exists() another
process may create the file so os.open(..., O_EXCL) raises FileExistsError;
change the block around the os.open/os.write to catch FileExistsError from
os.open (and any subsequent failure to create) and in that case read and return
the file contents (path.read_bytes()) instead of propagating the exception;
ensure you only close the file descriptor if os.open succeeded (i.e., open in a
try: fd = os.open(...) except FileExistsError: return path.read_bytes() then
proceed with writing and finally os.close(fd)).
In `@tests/shared/test_coord_event_log.py`:
- Around line 44-64: The test relies on constants DEFAULT_LEDGER_DB,
DEFAULT_JSONL_MIRROR, and DEFAULT_SPOOL_DIR that are evaluated at import time in
shared.coord_event_log, so make the test hermetic by setting/clearing relevant
env vars (e.g. HAPAX_COORD_DIR and XDG_CACHE_HOME) to a controlled temporary
path, then reload the shared.coord_event_log module (via importlib.reload) so
DEFAULT_* are recomputed from the sanitized environment before performing the
assertions; alternatively call the module's dynamic resolver function if
available to obtain fresh paths instead of using the already-imported constants.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 31bfcc38-ac01-4bdd-a3fc-6a4e6de3534d
📒 Files selected for processing (9)
hooks/scripts/escape-grant.shscripts/coord-boot-reconcilescripts/coord-grant-mintshared/coord_event_log.pyshared/governance/coord_capabilities.pysystemd/units/hapax-coord-provision.servicetests/scripts/test_coord_boot_reconcile.pytests/shared/test_coord_event_log.pytests/systemd/test_coord_provision_unit.py
| base = Path(base_dir) if base_dir is not None else coord_base_dir() | ||
| gdir = Path(grant_dir) if grant_dir is not None else default_grant_dir() | ||
| gkey = Path(grant_key) if grant_key is not None else default_grant_key() |
There was a problem hiding this comment.
Derive grant defaults from the explicit base_dir.
If a caller passes base_dir but leaves grant_dir/grant_key unset, Lines 665-666 still resolve the grant paths from the process environment instead of the overridden base. That can provision the ledger under one tree and the escape-grant artifacts under another.
Suggested fix
base = Path(base_dir) if base_dir is not None else coord_base_dir()
- gdir = Path(grant_dir) if grant_dir is not None else default_grant_dir()
- gkey = Path(grant_key) if grant_key is not None else default_grant_key()
+ if grant_dir is not None:
+ gdir = Path(grant_dir)
+ else:
+ explicit_grant_dir = os.environ.get(GRANT_DIR_ENV, "").strip()
+ gdir = Path(explicit_grant_dir) if explicit_grant_dir else base / "grants"
+
+ if grant_key is not None:
+ gkey = Path(grant_key)
+ else:
+ explicit_grant_key = os.environ.get(GRANT_KEY_ENV, "").strip()
+ gkey = Path(explicit_grant_key) if explicit_grant_key else base / "grant-key"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shared/coord_event_log.py` around lines 664 - 666, The grant paths should be
derived from the resolved base (the local variable base) when grant_dir or
grant_key are not provided; currently gdir and gkey call default_grant_dir() /
default_grant_key() which may use the process environment/coord_base_dir()
instead of the caller-provided base_dir. Change the logic so that after
resolving base = Path(base_dir) if base_dir is not None else coord_base_dir(),
you compute gdir = Path(grant_dir) if grant_dir is not None else
default_grant_dir(base) (or otherwise construct the grant path relative to base)
and likewise gkey = Path(grant_key) if grant_key is not None else
default_grant_key(base); if default_grant_dir/default_grant_key do not accept a
base param, update them or build the grant paths by joining base with the
relative default grant filenames so the grant artifacts inhabit the same base
tree.
| path = Path(path) | ||
| if path.exists(): | ||
| return path.read_bytes() | ||
| path.parent.mkdir(parents=True, exist_ok=True) | ||
| key = secrets.token_bytes(32) | ||
| fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o600) | ||
| try: | ||
| os.write(fd, key) | ||
| finally: | ||
| os.close(fd) | ||
| return key |
There was a problem hiding this comment.
TOCTOU race: unhandled FileExistsError when another process wins the create.
If two processes call this concurrently, both pass the exists() check, but only one succeeds at O_EXCL. The loser raises FileExistsError instead of reading the key the winner wrote.
Since the systemd provisioner and coord-grant-mint share this helper, concurrent invocation (boot + manual call, parallel tests) is plausible.
Proposed fix
path = Path(path)
if path.exists():
return path.read_bytes()
path.parent.mkdir(parents=True, exist_ok=True)
key = secrets.token_bytes(32)
- fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o600)
+ try:
+ fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o600)
+ except FileExistsError:
+ # Another process won the race; read their key
+ return path.read_bytes()
try:
os.write(fd, key)
finally:
os.close(fd)
return key📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| path = Path(path) | |
| if path.exists(): | |
| return path.read_bytes() | |
| path.parent.mkdir(parents=True, exist_ok=True) | |
| key = secrets.token_bytes(32) | |
| fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o600) | |
| try: | |
| os.write(fd, key) | |
| finally: | |
| os.close(fd) | |
| return key | |
| path = Path(path) | |
| if path.exists(): | |
| return path.read_bytes() | |
| path.parent.mkdir(parents=True, exist_ok=True) | |
| key = secrets.token_bytes(32) | |
| try: | |
| fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o600) | |
| except FileExistsError: | |
| # Another process won the race; read their key | |
| return path.read_bytes() | |
| try: | |
| os.write(fd, key) | |
| finally: | |
| os.close(fd) | |
| return key |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shared/governance/coord_capabilities.py` around lines 319 - 329, The code has
a TOCTOU race: after path.exists() another process may create the file so
os.open(..., O_EXCL) raises FileExistsError; change the block around the
os.open/os.write to catch FileExistsError from os.open (and any subsequent
failure to create) and in that case read and return the file contents
(path.read_bytes()) instead of propagating the exception; ensure you only close
the file descriptor if os.open succeeded (i.e., open in a try: fd = os.open(...)
except FileExistsError: return path.read_bytes() then proceed with writing and
finally os.close(fd)).
| def test_default_paths_are_a_user_writable_coord_ledger_outside_worktrees() -> None: | ||
| # Must NOT be the old root-owned /var/lib/hapax/coord that uid 1000 could never | ||
| # provision — that default left R2 unmaterialized and R3 inert (reform-improve | ||
| # coord SSOT provisioning). | ||
| for path in (DEFAULT_LEDGER_DB, DEFAULT_JSONL_MIRROR, DEFAULT_SPOOL_DIR): | ||
| assert not str(path).startswith("/var/lib/"), path | ||
|
|
||
| # One fixed `.../hapax/coord` tree; the three artifacts share that base. | ||
| coord = DEFAULT_LEDGER_DB.parent | ||
| assert coord.name == "coord" | ||
| assert coord.parent.name == "hapax" | ||
| assert coord / "ledger.db" == DEFAULT_LEDGER_DB | ||
| assert coord / "ledger.jsonl" == DEFAULT_JSONL_MIRROR | ||
| assert coord / "spool" == DEFAULT_SPOOL_DIR | ||
|
|
||
| # Outside every git worktree (NEW-4). | ||
| repo_root = Path.cwd().resolve() | ||
| for path in (DEFAULT_LEDGER_DB, DEFAULT_JSONL_MIRROR, DEFAULT_SPOOL_DIR): | ||
| assert path.is_absolute() | ||
| assert not path.resolve().is_relative_to(repo_root) | ||
| assert "evidence" not in path.parts |
There was a problem hiding this comment.
Make this test independent of the runner environment.
DEFAULT_* are snapshotted when shared.coord_event_log is imported at Lines 10-18, so a pre-set HAPAX_COORD_DIR or XDG_CACHE_HOME in CI can change what this test sees. Please clear/reload or assert against the dynamic resolver here so the test stays hermetic.
One way to isolate it
-def test_default_paths_are_a_user_writable_coord_ledger_outside_worktrees() -> None:
+def test_default_paths_are_a_user_writable_coord_ledger_outside_worktrees(
+ monkeypatch: pytest.MonkeyPatch,
+) -> None:
+ import importlib
+ import shared.coord_event_log as coord_event_log
+
+ monkeypatch.delenv("HAPAX_COORD_DIR", raising=False)
+ monkeypatch.delenv("XDG_CACHE_HOME", raising=False)
+ coord_event_log = importlib.reload(coord_event_log)
+
- for path in (DEFAULT_LEDGER_DB, DEFAULT_JSONL_MIRROR, DEFAULT_SPOOL_DIR):
+ for path in (
+ coord_event_log.DEFAULT_LEDGER_DB,
+ coord_event_log.DEFAULT_JSONL_MIRROR,
+ coord_event_log.DEFAULT_SPOOL_DIR,
+ ):
assert not str(path).startswith("/var/lib/"), path🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/shared/test_coord_event_log.py` around lines 44 - 64, The test relies
on constants DEFAULT_LEDGER_DB, DEFAULT_JSONL_MIRROR, and DEFAULT_SPOOL_DIR that
are evaluated at import time in shared.coord_event_log, so make the test
hermetic by setting/clearing relevant env vars (e.g. HAPAX_COORD_DIR and
XDG_CACHE_HOME) to a controlled temporary path, then reload the
shared.coord_event_log module (via importlib.reload) so DEFAULT_* are recomputed
from the sanitized environment before performing the assertions; alternatively
call the module's dynamic resolver function if available to obtain fresh paths
instead of using the already-imported constants.
What
Provision the coordination SSOT + grant tree so the R2 event log and the R3 daemon-independent escape grant are actually LIVE (reform-improve, CASE-SDLC-REFORM-001).
Root cause (confirmed live)
DEFAULT_COORD_DIR=/var/lib/hapax/coordis root-owned;hapax(uid 1000) cannotmkdirinto/var/lib/hapax. Consequences:coord_event_logSSOT never materialized there —path.parent.mkdirraisedPermissionError, swallowed.escape_grant_allows()always returned 1 (no grant dir/key), leaving only the deprecatedHAPAX_*_OFFoff-switch.coord-grant-mintcould not create the dir/key.Fix (no root — parent spec remediation #1)
Make
~/.cache/hapax/coordthe canonical default: user-writable, already where the live ledger sits, still one fixed location outside every worktree (NEW-4). Centralized inshared/coord_event_log.pyso the Python minter and the bash shim resolve identically (precedence: explicit grant override →$HAPAX_COORD_DIR→$XDG_CACHE_HOME→~/.cache).coord_base_dir/default_grant_dir/default_grant_key;default_event_log()now dynamic.provision_coord_tree()— daemon-independent provisioner: idempotently materializes{base, spool/, grants/}+ the 0600 grant-key and proves the base writable, raising LOUDLY on failure (no silent swallow — satisfies the fail-loud AC in-scope, without touching the out-of-scope daemon swallow).coord-boot-reconcile --provision— boot entry (provision then reconcile); exit 2 on an unwritable SSOT.systemd/units/hapax-coord-provision.service— oneshot,After=hapax-secrets,# Hapax-Auto-Enable: trueso the deploy (fix(sdlc): deploy auto-enables marked timers/services + activate FM-11 lane supervisor #3817) enables it on merge → provisioned at boot, no manual intervention.coord-grant-mint+escape-grant.shconsume the shared cache default;load_or_create_keycentralized incoord_capabilities.Live verification (real host)
~/.cache/hapax/coord(spool/, grants/, grant-key created; ledger replayed=1).cc-task-gategrant flipsescape_grant_allows1→0 (block→allow) with NO env override — R3 escape LIVE.ruffclean ·pyrefly0 errors · 348 coord/capability/hook/importer tests green.Acceptance criteria
Task:
reform-improve-coord-ssot-provisioning-20260601🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Chores