Fix atlas state-token precision stranding approved seeds#104
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
getAtlasStateTokenreturnsMAX(updated_at)from the atlas tables, but the value round-trips through a JSDate, truncating PostgreSQL's microsecond precision to milliseconds (e.g.…28.78683→…28.786Z). The incremental acquire path then bounds withupdated_at > $lastToken AND updated_at <= $newTokenat full microsecond precision, so the row that produced the token falls in the sub-millisecond gap: it failsupdated_at <= tokenin the run that generated the token, and every later run bounds> token AND <= token— the row is permanently stranded. Symptom: approved seeds never get indexed by incremental reindex, silently ("Indexing complete" with 0 items).Fix: compare with
date_trunc('milliseconds', updated_at)on BOTH bound clauses inaddUpdatedAtClauses, so comparison precision matches token precision. This is the minimal correct surface: truncating insideMAX()ingetAtlasStateTokenwould be redundant (the JSDateround-trip already truncates the token) and would not by itself close the comparison gap — the mismatch only matters where the bounds are evaluated.Found by the atlas sandbox live run; hot-patch validated there end-to-end before this PR (seed approved at
01:46:28.78683, token2026-06-11T01:46:28.786Z, 0 chunks indexed before the patch; seed indexed with 1 chunk and a real search hit after).Test plan
src/__tests__/atlas-db.test.ts: approved seed withupdated_at = '2026-06-11T01:46:28.786830+00', token captured, incremental bounds applied. Red against unfixed code:AssertionError: expected [] to deeply equal [ 'micro' ]. Green after the fix.(token, token]must return[]. Red-verified against an un-truncatedupdated_at > token(boundary row re-emits forever) before restoring the fix.changedOnOrBeforeonly) includes the previously-stranded row — pins the deploy-step recovery mechanism in code.resolveAtlasStateTokenunit tests: throws on an unparseable non-nullMAX(updated_at)(fail loud instead of silently shrinking the window),nullon empty tables, max-of-maxes as ms ISO string.tsc --noEmiton both tsconfigs (tsconfig.json,tsconfig.scripts.json)npm run buildcleanDeploy step (one-time, REQUIRED)
The bound fix is forward-only: a prod row stranded by the pre-fix bug has
date_trunc('milliseconds', updated_at)exactly equal to the persisted token, and the fixed lower bound is strict (> token), so the already-stranded row stays excluded forever with all signals green — deploying the code alone does not heal it. Clearing the persisted state token forces one full re-acquire (idempotent re-index), which bounds only withchangedOnOrBeforeand picks the stranded rows back up.Note the wall-clock time, then run (idle-gated: a run in flight reads state before and writes the token after this UPDATE, which would silently cancel the clear):
Every atlas row must come back in RETURNING. If any row was skipped (a run was in flight), wait for it to finish and re-run the UPDATE until all atlas rows are returned.
Confirm the clear took effect before the next orchestrator run:
After the next orchestrator run, verify recovery actually ran:
This step is part of this PR's definition of done — whoever merges deploys and runs it in the same motion.