Skip to content

Atlas ratification robustness + admin control surface operator docs#98

Merged
jpr5 merged 6 commits into
mainfrom
follow-up/atlas-hardening-and-admin-docs
Jun 8, 2026
Merged

Atlas ratification robustness + admin control surface operator docs#98
jpr5 merged 6 commits into
mainfrom
follow-up/atlas-hardening-and-admin-docs

Conversation

@jpr5

@jpr5 jpr5 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-ups to #97 (the admin control surface + unified admin-access auth). Hardens the Atlas ratification path and documents the admin control surface for self-hosted Pathfinder operators.

  • Approve no longer reports a committed approval as a failure — the approve-time reindex enqueue moved outside the DB-write try/catch. A queue hiccup after a durable approval now returns 200 { reindexQueued: false } with a loud log, instead of a misleading 500 (which previously led to a confusing 409 on retry). rejectAtlasCandidate was checked — it has no post-write enqueue, so no change.
  • Row-attributed JSON parse errorsparseJsonObject/Array/Number now wrap JSON.parse and throw with column + row context (e.g. provenance of seed row id=42 key=...), so one malformed blob no longer poisons a whole list query with an opaque SyntaxError.
  • toDate validates — an unparseable timestamp now warns + returns null instead of producing a silent Invalid Date (NaN sorts / RangeError).
  • Webhook 500s carry a correlation id — mirrors the /analytics pattern so failed deliveries are greppable.
  • Operator docs — new "Admin control surface" section in the deploy guide: shared ANALYTICS_TOKEN auth (401/503), POST /admin/reindex (full/source/repo, 202/400) and GET/POST /admin/index-stats with curl examples. Also fixes the Docker Compose example to include the required MCP_JWT_SECRET (previously omitted → crash-on-boot for copy-paste deploys) and corrects the ephemeral-JWT-secret note to apply to any non-production env.

Testing

  • Full suite green (4386 tests); typecheck + build clean.
  • New red-green coverage: approve-enqueue-throws → 200 reindexQueued:false (durable), malformed-JSON → context-bearing error, invalid timestamp → null.

Review

  • Implementation round + two 7-agent unbiased CR rounds → zero must-fix findings; no access broaden/narrow regression.

Deferred to follow-ups (not introduced here)

  • Atlas acquisition precision (recommended next): getAtlasStateToken round-trips through a millisecond-precision Date while updated_at is microsecond-precision in Postgres, so the changedOnOrBefore bound can silently drop the newest row from incremental indexing; plus incrementalAcquire collapses to an empty window when the state token is null. Worth a dedicated correctness pass.
  • Minor: tighten the auth-doc 503-vs-401 wording for the non-production analytics-enabled case; loud-handling for parseNumberArray/non-object provenance; added coverage for the 400 atlas_candidate_key_required and 500-fallback branches.

jpr5 added 6 commits June 8, 2026 10:57
Wrap the JSON.parse calls in parseJsonObject/parseJsonArray/parseNumberArray
so a single malformed provenance/evidence/generated_seed_ids blob throws a
row-attributed error (column + row identity) instead of a bare SyntaxError
that poisons the whole list query into an opaque 500. Validate toDate so an
unparseable timestamp warns and returns null rather than silently yielding an
Invalid Date that produces NaN sorts or a RangeError in toISOString.
Add an Admin control surface section covering the shared ANALYTICS_TOKEN
bearer auth (401 on missing/invalid, 503 fail-closed when unset), the
POST /admin/reindex op (full/source/repo scopes, 202/400) and the
GET /admin/index-stats read op, each with an authenticated curl example.
Once an approval is durably persisted, a reindex-enqueue hiccup must not be
reported as a failure. Previously a throw from queueSourceReindex routed
through the generic 500 handler even though the approval had committed, so the
reviewer retried and hit a confusing 409 (already approved). Split approval
into two phases: the DB write keeps its 409/500 error mapping, and the queue
enqueue runs in its own try/catch returning 200 + reindexQueued:false with a
loud log on failure, mirroring the existing no-orchestrator branch.
The github/slack/discord webhook catch blocks logged a generic error and
returned a generic 500 with no correlation id, unlike the analytics sendFile
path. Emit a correlation_id in both the log and the response so a failed
delivery is greppable, mirroring the existing analytics pattern.
…tion

Document the 400 unknown-source/repo and 503 orchestrator/config-unavailable
outcomes on adminReindexOp (previously only the 202 path was noted), and guard
the Slack-notify queued extraction so a future op whose 202 body lacks queued
can't emit undefined into the notification.
Add MCP_JWT_SECRET to the Docker Compose app environment block so a
copy-pasted production compose doesn't crash-loop on the required-secret
startup check, and reword the env-table note to say the ephemeral secret
is generated for any non-production NODE_ENV (not just development).

Correct the approve-durability test comment to describe what it actually
asserts (absence from the pending list), and pin the synchronous-contract
assumption behind the queueSourceReindex try/catch.
@jpr5 jpr5 merged commit 0cc8881 into main Jun 8, 2026
7 checks passed
@jpr5 jpr5 deleted the follow-up/atlas-hardening-and-admin-docs branch June 8, 2026 18:48
jpr5 added a commit that referenced this pull request Jun 9, 2026
…s) + null-state-token window (#99)

## Summary
Follow-up to #98 — fixes two correctness bugs in Atlas incremental
indexing that a dry-run/review surfaced.

- **Microsecond high-water precision (the silent-drop fix).**
`getAtlasStateToken` previously round-tripped the high-water mark
through a millisecond JS `Date`, so the `updated_at <= token` bound
dropped the newest row whenever its timestamp had sub-millisecond digits
— permanently, until something else bumped it. Now the token is the raw
microsecond text (`to_char(... 'US')`) and acquisition windows bind as
`$N::timestamptz` text (never a JS `Date`), so Postgres compares at full
microsecond precision: the high-water row is included exactly once and
never double-fetched or dropped. (Removed the interim
`ceilStateTokenToMillis` rounding entirely.)
- **Null-state-token window.** `incrementalAcquire` no longer builds a
guaranteed-empty `> T AND <= T` window when the high-water read returns
null — it warns loudly and skips, carrying the (validated) checkpoint
forward.
- **Fail-loud checkpoint guard.** `parseLowerBound` validates
`lastStateToken` against the exact emitted µs format and throws a
context-bearing error on garbage, *before* the null-token early return
(so a corrupt checkpoint can't silently pass through on an empty
source).

## Testing
- New PGlite-independent bind-path test asserts the window binds as
`::timestamptz` text (no `Date` truncation), plus a two-run integration
test proving include-once / no-double-fetch / no-drop at adjacent
microseconds — the race is reproduced and proven gone.
- Fail-loud guard tests (unparseable, JS-parseable-but-wrong-format,
corrupt-token-on-empty-source); spy tests carry a positive-control
assertion so ESM interception can't pass vacuously.
- Full suite green (4503 tests); typecheck/build/prettier clean.

## Review
- Implementation + two unbiased 7-agent CR rounds → zero must-fix
findings; the precision machinery and guards are confirmed correct.

## Deferred — recommended deeper follow-up (NOT introduced here;
pre-existing)
Reviewers repeatedly (and correctly) flag that the high-water token read
and the content/removed reads are **not a single transactional
snapshot**, and the token is a **combined** seed+cache high-water mark.
Under concurrent writes or an exact same-microsecond cross-partition
tie, a row can be skipped for a cycle. This is **self-healed by the
nightly unconditional full reindex** and is not a regression from this
change (microsecond precision makes ties strictly rarer). Closing it
properly wants a transactional/repeatable-read snapshot around
token+content reads, or per-source tokens — a worthwhile dedicated
follow-up. Also pre-existing and out of scope here: repo-filtered
orphan-cache-page removal, and `clearAtlasCachePageStale({content:""})`
split-state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant