Skip to content

Fix atlas state-token precision stranding approved seeds#104

Merged
jpr5 merged 4 commits into
mainfrom
fix/atlas-state-token-precision
Jun 11, 2026
Merged

Fix atlas state-token precision stranding approved seeds#104
jpr5 merged 4 commits into
mainfrom
fix/atlas-state-token-precision

Conversation

@jpr5

@jpr5 jpr5 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

getAtlasStateToken returns MAX(updated_at) from the atlas tables, but the value round-trips through a JS Date, truncating PostgreSQL's microsecond precision to milliseconds (e.g. …28.78683…28.786Z). The incremental acquire path then bounds with updated_at > $lastToken AND updated_at <= $newToken at full microsecond precision, so the row that produced the token falls in the sub-millisecond gap: it fails updated_at <= token in the run that generated the token, and every later run bounds > token AND <= token — the row is permanently stranded. Symptom: approved seeds never get indexed by incremental reindex, silently ("Indexing complete" with 0 items).

Fix: compare with date_trunc('milliseconds', updated_at) on BOTH bound clauses in addUpdatedAtClauses, so comparison precision matches token precision. This is the minimal correct surface: truncating inside MAX() in getAtlasStateToken would be redundant (the JS Date round-trip already truncates the token) and would not by itself close the comparison gap — the mismatch only matters where the bounds are evaluated.

Found by the atlas sandbox live run; hot-patch validated there end-to-end before this PR (seed approved at 01:46:28.78683, token 2026-06-11T01:46:28.786Z, 0 chunks indexed before the patch; seed indexed with 1 chunk and a real search hit after).

Test plan

  • New red-green regression test in src/__tests__/atlas-db.test.ts: approved seed with updated_at = '2026-06-11T01:46:28.786830+00', token captured, incremental bounds applied. Red against unfixed code: AssertionError: expected [] to deeply equal [ 'micro' ]. Green after the fix.
  • Second-cycle assertion pins the LOWER bound: querying (token, token] must return []. Red-verified against an un-truncated updated_at > token (boundary row re-emits forever) before restoring the fix.
  • Recovery-path assertion: a fullAcquire-shaped query (changedOnOrBefore only) includes the previously-stranded row — pins the deploy-step recovery mechanism in code.
  • resolveAtlasStateToken unit tests: throws on an unparseable non-null MAX(updated_at) (fail loud instead of silently shrinking the window), null on empty tables, max-of-maxes as ms ISO string.
  • Full suite: 324 files, 5933/5933 passing
  • tsc --noEmit on both tsconfigs (tsconfig.json, tsconfig.scripts.json)
  • npm run build clean

Deploy step (one-time, REQUIRED)

The bound fix is forward-only: a prod row stranded by the pre-fix bug has date_trunc('milliseconds', updated_at) exactly equal to the persisted token, and the fixed lower bound is strict (> token), so the already-stranded row stays excluded forever with all signals green — deploying the code alone does not heal it. Clearing the persisted state token forces one full re-acquire (idempotent re-index), which bounds only with changedOnOrBefore and picks the stranded rows back up.

Note the wall-clock time, then run (idle-gated: a run in flight reads state before and writes the token after this UPDATE, which would silently cancel the clear):

UPDATE index_state
SET last_commit_sha = NULL
WHERE source_type = 'atlas'
  AND status <> 'indexing'
RETURNING source_key, status;

Every atlas row must come back in RETURNING. If any row was skipped (a run was in flight), wait for it to finish and re-run the UPDATE until all atlas rows are returned.

Confirm the clear took effect before the next orchestrator run:

SELECT source_key, last_commit_sha, status
FROM index_state WHERE source_type = 'atlas';
-- expect: last_commit_sha IS NULL on every row

After the next orchestrator run, verify recovery actually ran:

SELECT source_key, last_commit_sha, status, last_indexed_at
FROM index_state WHERE source_type = 'atlas';
-- expect: repopulated last_commit_sha, status = 'idle', AND
-- last_indexed_at LATER than the time you ran the UPDATE above
-- (proves the full re-acquire ran after the clear, not pre-deploy
-- state surviving a clobber).

This step is part of this PR's definition of done — whoever merges deploys and runs it in the same motion.

@jpr5 jpr5 merged commit 8b94c8d into main Jun 11, 2026
6 checks passed
@jpr5 jpr5 deleted the fix/atlas-state-token-precision branch June 11, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant