Skip to content

fix: atomically swap cached native binary#591

Merged
ousamabenyounes merged 4 commits into
mksglu:nextfrom
mturac:fix/issue-587-atomic-native-cache-swap
May 17, 2026
Merged

fix: atomically swap cached native binary#591
ousamabenyounes merged 4 commits into
mksglu:nextfrom
mturac:fix/issue-587-atomic-native-cache-swap

Conversation

@mturac
Copy link
Copy Markdown
Contributor

@mturac mturac commented May 15, 2026

Summary

  • copy cached better-sqlite3 ABI binaries to a staging file before replacing the active binary
  • codesign the staging file before the atomic rename on macOS
  • keep the existing ABI cache behavior and update the test harness for the new helper

Fixes #587

Tests

  • git diff --check
  • pnpm exec vitest run tests/core/cli.test.ts -t "ABI-aware native binary caching|cache hit"
  • pnpm exec tsc --noEmit

@mksglu mksglu changed the base branch from main to next May 16, 2026 16:34
@ousamabenyounes
Copy link
Copy Markdown
Collaborator

Hey @mturac — I checked out this branch and ran the empirical side of the contract you're adding. Sharing the result because it's directly relevant to one missing piece.

Verified

  • pnpm exec vitest run tests/core/cli.test.ts -t "ABI-aware native binary caching|cache hit"12 passed
  • pnpm exec tsc --noEmit → clean

Then I built a small filesystem-level repro to check the race itself. macOS codesign + Mach-O loader aren't reachable on Linux, but the underlying race in #587 is a multi-syscall copyFileSync window — that part is platform-independent and is exactly what rename(2) closes. I stubbed codesignBinary with a 40 ms busy-wait, ran 30 swap iterations alternating between two stable cache contents (HASH_A / HASH_B), and forked 8 concurrent readers hashing binaryPath in a tight loop. Any observed hash other than HASH_A or HASH_B = a caught-mid-write read.

Implementation Reads Partial states observed
copyFileSync + codesignBinary (pre-PR, in-place) 7288 332 distinct (0 B, 4 KB, 3.3 MB, 7.4 MB, …)
copyFileSynccodesignBinaryrenameSync (this PR) 1722 0

So the atomicity guarantee is real — readers only see one of the two stable inodes, never an in-flight one. On macOS the same fs window is what leaves the codesign signature invalid during dlopen, which is what trips F_ADDFILESIGS_RETURN errno=37 → SIGKILL/SIGBUS in #587.

What's missing

The PR doesn't ship a test that fails when the patch is reverted. Both cache hit: … tests assert only the final content, so they pass identically against the old in-place implementation — the atomicity + rollback contract is uncovered.

A cheap deterministic addition would be a codesign-failure rollback test, which exercises the two things this PR uniquely guarantees:

test("cache hit: codesign failure rolls back and cleans up staging file", async () => {
  // loadEnsureNativeCompat with a codesignBinary stub that throws
  const ensureNativeCompat = await loadEnsureNativeCompat({ codesignThrows: true });
  createFakeBinary(abiCachePath(), "new-cached-content");
  createFakeBinary(binaryPath, "old-active-content");

  expect(() => ensureNativeCompat(tempDir)).toThrow();

  // (1) active binary unchanged — never observable in inconsistent state
  expect(readFileSync(binaryPath, "utf-8")).toBe("old-active-content");
  // (2) staging tempfile cleaned up on the error path
  expect(readdirSync(releaseDir).filter(f => f.includes(".staging-"))).toHaveLength(0);
});

The existing loadEnsureNativeCompat regex extractor already splices helpers — passing an option that swaps the extracted codesignBinary body for a throwing stub is a 5-line change. Happy to push this as a follow-up commit if useful.

Otherwise the conservative scope here is exactly right (atomic rename only, no .signed sentinel — matches what the reporter recommended in their "Notes / uncertainty" section).

(Minor: the stats.json bump looks unrelated and would be cleaner in a separate PR.)

@mksglu
Copy link
Copy Markdown
Owner

mksglu commented May 17, 2026

@mturac resolve conflicts and comments please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostToolUse hook intermittently exits non-zero on macOS during cold start

3 participants