fix(open-knowledge): MCP writes reflect on-disk truth — reconcile…#29
Merged
Conversation
…-of-band edits (PRD-6832) (#1591) * fix(open-knowledge): MCP writes reflect on-disk truth — reconcile out-of-band edits (PRD-6832) PRD-6832 β: a stale loaded CRDT could silently clobber a newer out-of-band disk edit on the next MCP write, making disk itself stale (the "silent stale read"). This makes disk authoritative on the write/reconcile path. - L1 reconcile-before-apply (write/edit/frontmatter + the rename spine): ingest a divergent disk edit via applyExternalChange before the agent edit lands; both survive. A doc that's mid-conflict is never reconciled (the refusal owns it). - L3 store-time backstop: on the residual TOCTOU, abort the overwrite (disk wins) and return urn:ok:error:disk-divergence (409); retry re-applies once. Gated to agent-write stores via a pre-executeNow marker — Hocuspocus passes a null transaction origin for agent DirectConnection writes, so origin can't gate. - FR3: success-path disk-edit-reconciled warning (structuredContent.contentDivergence + text nudge) so the agent re-reads the combined result. Rename serializes the loaded CRDT and bypasses storeDocumentNow, so it needed its own spine reconcile (L3 can't guard it). FR5 (replace recovery checkpoint) deferred; ContentDivergenceWarningSchema is a stop-gap for the unmerged #1270. Spec: specs/2026-05-30-ok-mcp-read-reflects-disk/. * test(open-knowledge): give rollback-fault test CI-git headroom (pollUntil 25s + 60s timeout) The version-rollback fault test waits on two shadow-repo checkpoint commits. Shadow-repo git commits are slow on loaded CI runners — the pollUntil(12000) budget and the inherited bun 5s per-test default are both too tight there (the checkpoints land fine in isolation). Size the pollUntil budgets to 25s each and the per-test timeout to 60s; pollUntil returns as soon as a checkpoint lands, so the fast path is unchanged — this only buys headroom for slow-git CI runs. * fix(open-knowledge): wire L1 reconcile into legacy agent-write; note no-op/L3 residual (PR review) Addresses pullfrog review on #1591: - agent-write (legacy /api/agent-write content handler) had the L3 backstop + the AgentWriteSuccessSchema warning field but no L1 reconcile, so the common case (disk diverged before the call) clobbered. Wire reconcileDiskBeforeAgentWrite in for symmetry with the other content handlers; the warning is now producible. - Document the no-op-skip-precedes-L3 residual for the L1-less handlers (undo/rollback) in the spec as-built notes: a no-op store skip never overwrites, so there's no data loss — only an absent signal for a write that applied no bytes. * Revert rollback-fault test budget bump — β should not touch this test The version-rollback fault test fails under merge-queue I/O contention on a hardcoded internal pollUntil(…, 12000) waiting for a shadow-repo checkpoint commit. This is a known, general merge-queue flake (it has ejected #1270, #1411, #1590 — PRs unrelated to this diff) and is owned by the deflake work in #1588, which also added a suite-wide 30s --timeout that makes a per-test bump redundant. A bigger budget doesn't fix it (the internal poll is the killer), and a proper fix overlaps #1588 + may surface a real shadow-repo-commit-under-load perf signal that shouldn't be masked inside a disk-authority PR. Restore the file to main. * fix(open-knowledge): guard L3 disk-wins ingest against false-success 200 (PRD-6832) Address the Major finding from code review: the L3 store-time backstop's disk-wins applyDiskContent transact sat outside any failure-recording try/catch (the atomic-write catch is below it; the withSpan callback opens no outer try), so an ingest-throw would propagate to a false-success 200 — agent bytes on neither disk nor re-ingested CRDT. Mirror the two already- guarded identical applyDiskContent calls (tripwire-reset + atomic-write): on throw, record a store failure + rethrow so the awaiting handler surfaces a storage error via the already-tested takeStoreFailure path. The base stays at currentBase so a retry re-reconciles. Also fold in two review cosmetics in agent-write.ts (both insertion artifacts of this PR): add the missing `satisfies StandardSchemaV1` to ContentDivergenceWarningSchema, and restore the AgentWriteSuccessSchema JSDoc displaced by that insertion. Document the intentional resolveEmbed omission at the rename reconcile site (content-bytes only; ref round-trips). * test+harden(open-knowledge): close /review-local gaps on the L3 disk-authority surface (PRD-6832 β) Address the 6 findings from the local adversarial review (no impl bug found; all test-coverage + consistency/observability): - Major: add an undo-handler L3 test — undo's ONLY disk-authority guard (no L1). Asserts 409 + urn:ok:error:disk-divergence + native survives + undo not applied; recovery via a forward agent-write-md (undo's post-revert UM-stack semantics are intentionally out of scope). - Major: add a gate-exclusion test — an unmarked human/client store is never reverted by L3 (presence assertion sidesteps the seam write/rename race). Fix the phantom citation to a nonexistent store-divergence-gate.test.ts. - Minor: DiskEditReconciledWarning byte fields use Buffer.byteLength (UTF-8) to match the sibling ContentDivergenceWarning on the shared WriteWarningSchema union (was string.length / UTF-16). - Minor: surface the L3 disk-read fail-open with a structured log.warn (terminal guard; log-only rather than shoehorn the external-change-namespaced counter). - Consider: version.ts MCP tool parses WriteWarningSchema for symmetry with the 3 write/edit tools (rollback emits only content-divergence today; future-proof). - Consider: document the setReconciledBase synchronous base-advance as the file-watcher double-ingest prevention invariant. Verified: typecheck clean; new tests 3x deterministic. --------- GitOrigin-RevId: 4b588cb1d0135626c7397c0018e9034f263e08bf
Contributor
There was a problem hiding this comment.
Automated approval from agents-private public-mirror-sync (run: https://github.com/inkeep/agents-private/actions/runs/26927629918). Source of truth is the monorepo; direct edits on inkeep/open-knowledge are overwritten on next sync.
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(open-knowledge): MCP writes reflect on-disk truth — reconcile out-of-band edits instead of silently clobbering them (PRD-6832)
An OK MCP read could authoritatively return content older than the bytes on disk, with no warning. The root cause was upstream of the read: a doc loaded in the server holds stale in-memory CRDT state, an out-of-band edit (a script,
git pull, manual edit) makes disk newer, and the next MCP write serializes the stale CRDT over the newer file — making disk itself stale, so the next read faithfully returns bad bytes. In one incident this gave an agent a stale spec scope and cost a multi-file revert.This change makes disk authoritative on the write/reconcile path:
write_document/edit_document/edit_frontmatter) or arenameapplies, the server compares the on-disk bytes to the last-synced base and, on divergence, ingests the disk edit first (through the existing sanctioned file-watcher path) so the agent's edit lands on top of current reality. Both edits survive.urn:ok:error:disk-divergence(409) — the agent's edit was NOT applied; re-read and retry (a retry re-applies exactly once). This is the only guard forundo/rollback, which have no L1.disk-edit-reconciledwarning (structuredContent.contentDivergence) so the agent re-reads to see the combined result. Observational — the write still landed and both edits are on disk.Human-editor (browser) writes are unaffected: the backstop only fires on agent-triggered stores, so in-progress typing is never reverted.