Atlas ratification robustness + admin control surface operator docs#98
Merged
Conversation
Wrap the JSON.parse calls in parseJsonObject/parseJsonArray/parseNumberArray so a single malformed provenance/evidence/generated_seed_ids blob throws a row-attributed error (column + row identity) instead of a bare SyntaxError that poisons the whole list query into an opaque 500. Validate toDate so an unparseable timestamp warns and returns null rather than silently yielding an Invalid Date that produces NaN sorts or a RangeError in toISOString.
Add an Admin control surface section covering the shared ANALYTICS_TOKEN bearer auth (401 on missing/invalid, 503 fail-closed when unset), the POST /admin/reindex op (full/source/repo scopes, 202/400) and the GET /admin/index-stats read op, each with an authenticated curl example.
Once an approval is durably persisted, a reindex-enqueue hiccup must not be reported as a failure. Previously a throw from queueSourceReindex routed through the generic 500 handler even though the approval had committed, so the reviewer retried and hit a confusing 409 (already approved). Split approval into two phases: the DB write keeps its 409/500 error mapping, and the queue enqueue runs in its own try/catch returning 200 + reindexQueued:false with a loud log on failure, mirroring the existing no-orchestrator branch.
The github/slack/discord webhook catch blocks logged a generic error and returned a generic 500 with no correlation id, unlike the analytics sendFile path. Emit a correlation_id in both the log and the response so a failed delivery is greppable, mirroring the existing analytics pattern.
…tion Document the 400 unknown-source/repo and 503 orchestrator/config-unavailable outcomes on adminReindexOp (previously only the 202 path was noted), and guard the Slack-notify queued extraction so a future op whose 202 body lacks queued can't emit undefined into the notification.
Add MCP_JWT_SECRET to the Docker Compose app environment block so a copy-pasted production compose doesn't crash-loop on the required-secret startup check, and reword the env-table note to say the ephemeral secret is generated for any non-production NODE_ENV (not just development). Correct the approve-durability test comment to describe what it actually asserts (absence from the pending list), and pin the synchronous-contract assumption behind the queueSourceReindex try/catch.
jpr5
added a commit
that referenced
this pull request
Jun 9, 2026
…s) + null-state-token window (#99) ## Summary Follow-up to #98 — fixes two correctness bugs in Atlas incremental indexing that a dry-run/review surfaced. - **Microsecond high-water precision (the silent-drop fix).** `getAtlasStateToken` previously round-tripped the high-water mark through a millisecond JS `Date`, so the `updated_at <= token` bound dropped the newest row whenever its timestamp had sub-millisecond digits — permanently, until something else bumped it. Now the token is the raw microsecond text (`to_char(... 'US')`) and acquisition windows bind as `$N::timestamptz` text (never a JS `Date`), so Postgres compares at full microsecond precision: the high-water row is included exactly once and never double-fetched or dropped. (Removed the interim `ceilStateTokenToMillis` rounding entirely.) - **Null-state-token window.** `incrementalAcquire` no longer builds a guaranteed-empty `> T AND <= T` window when the high-water read returns null — it warns loudly and skips, carrying the (validated) checkpoint forward. - **Fail-loud checkpoint guard.** `parseLowerBound` validates `lastStateToken` against the exact emitted µs format and throws a context-bearing error on garbage, *before* the null-token early return (so a corrupt checkpoint can't silently pass through on an empty source). ## Testing - New PGlite-independent bind-path test asserts the window binds as `::timestamptz` text (no `Date` truncation), plus a two-run integration test proving include-once / no-double-fetch / no-drop at adjacent microseconds — the race is reproduced and proven gone. - Fail-loud guard tests (unparseable, JS-parseable-but-wrong-format, corrupt-token-on-empty-source); spy tests carry a positive-control assertion so ESM interception can't pass vacuously. - Full suite green (4503 tests); typecheck/build/prettier clean. ## Review - Implementation + two unbiased 7-agent CR rounds → zero must-fix findings; the precision machinery and guards are confirmed correct. ## Deferred — recommended deeper follow-up (NOT introduced here; pre-existing) Reviewers repeatedly (and correctly) flag that the high-water token read and the content/removed reads are **not a single transactional snapshot**, and the token is a **combined** seed+cache high-water mark. Under concurrent writes or an exact same-microsecond cross-partition tie, a row can be skipped for a cycle. This is **self-healed by the nightly unconditional full reindex** and is not a regression from this change (microsecond precision makes ties strictly rarer). Closing it properly wants a transactional/repeatable-read snapshot around token+content reads, or per-source tokens — a worthwhile dedicated follow-up. Also pre-existing and out of scope here: repo-filtered orphan-cache-page removal, and `clearAtlasCachePageStale({content:""})` split-state.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-ups to #97 (the admin control surface + unified admin-access auth). Hardens the Atlas ratification path and documents the admin control surface for self-hosted Pathfinder operators.
200 { reindexQueued: false }with a loud log, instead of a misleading500(which previously led to a confusing409on retry).rejectAtlasCandidatewas checked — it has no post-write enqueue, so no change.parseJsonObject/Array/Numbernow wrapJSON.parseand throw with column + row context (e.g.provenance of seed row id=42 key=...), so one malformed blob no longer poisons a whole list query with an opaqueSyntaxError.toDatevalidates — an unparseable timestamp now warns + returnsnullinstead of producing a silentInvalid Date(NaN sorts /RangeError)./analyticspattern so failed deliveries are greppable.ANALYTICS_TOKENauth (401/503),POST /admin/reindex(full/source/repo, 202/400) andGET/POST /admin/index-statswith curl examples. Also fixes the Docker Compose example to include the requiredMCP_JWT_SECRET(previously omitted → crash-on-boot for copy-paste deploys) and corrects the ephemeral-JWT-secret note to apply to any non-production env.Testing
200 reindexQueued:false(durable), malformed-JSON → context-bearing error, invalid timestamp → null.Review
Deferred to follow-ups (not introduced here)
getAtlasStateTokenround-trips through a millisecond-precisionDatewhileupdated_atis microsecond-precision in Postgres, so thechangedOnOrBeforebound can silently drop the newest row from incremental indexing; plusincrementalAcquirecollapses to an empty window when the state token is null. Worth a dedicated correctness pass.parseNumberArray/non-object provenance; added coverage for the400 atlas_candidate_key_requiredand 500-fallback branches.