Skip to content

fix: tolerate docs corpus outages#1

Merged
steipete merged 3 commits into
openclaw:mainfrom
vyctorbrzezowski:vh/molty-corpus-fetch-resilience
Jun 7, 2026
Merged

fix: tolerate docs corpus outages#1
steipete merged 3 commits into
openclaw:mainfrom
vyctorbrzezowski:vh/molty-corpus-fetch-resilience

Conversation

@vyctorbrzezowski

@vyctorbrzezowski vyctorbrzezowski commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Use the published docs-search.json as the primary docs retrieval index so each chat request no longer depends on downloading llms-full.txt.
  • Keep llms-full.txt and /.well-known/llms-full.txt as compatibility fallbacks when the search index is unavailable or cannot answer an otherwise empty workspace query.
  • Keep building a workspace with source/GitHub context plus a mounted docs-unavailable note when docs retrieval is down.
  • Add smoke coverage for the runtime retrieval paths: usable docs index, docs-index failure with corpus fallback, and full docs outage with source/GitHub still mounted.

Related UI boundary PR: openclaw/docs#36.

Validation

  • npm run typecheck
  • npm run lint
  • npm run format:check
  • ASK_MOLTY_DOCS_REPO=... ASK_MOLTY_SOURCE_REPO=... ASK_MOLTY_GITCRAWL_DB=dist/gitcrawl-fixture.db ASK_MOLTY_OUT_DIR=dist/test npm run export && npm run smoke
  • git diff --check

@clawsweeper

clawsweeper Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 7, 2026, 1:55 AM ET / 05:55 UTC.

Summary
The PR changes Ask Molty retrieval to prefer docs-search.json, retry/fallback to llms-full.txt, preserve source/GitHub context during docs outages, add DOCS_INDEX_URL, and cover the runtime paths in smoke tests.

Reproducibility: yes. from source inspection: current main awaits the docs corpus fetch without a catch, so a 503 or invalid corpus response aborts buildWorkspace before usable source/GitHub context can be mounted. I did not run a local repro because this review is read-only, but the PR smoke test models the failing path directly.

Review metrics: 2 noteworthy metrics.

  • Changed surface: 5 files, +309/-13. The PR touches runtime retrieval, smoke coverage, env typing/config, and README docs for one bounded outage-handling change.
  • Runtime smoke paths: 3 scenarios added. The added smoke coverage exercises the docs index path, corpus fallback, and full docs outage behavior that the PR is meant to protect.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Next step before merge

  • No ClawSweeper repair lane is needed; the current head has targeted proof and no actionable code finding from this read-only review.

Security
Cleared: The diff changes Worker fetch fallback logic and configuration only; I found no concrete credential, auth, dependency, workflow, or supply-chain regression.

Review details

Best possible solution:

Merge the focused resiliency path after normal maintainer checks, keeping the docs search index primary with corpus compatibility fallback and a clear docs-unavailable workspace note.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection: current main awaits the docs corpus fetch without a catch, so a 503 or invalid corpus response aborts buildWorkspace before usable source/GitHub context can be mounted. I did not run a local repro because this review is read-only, but the PR smoke test models the failing path directly.

Is this the best way to solve the issue?

Yes: preferring the smaller published docs index, falling back to the legacy corpus, and mounting an explicit docs-unavailable note is narrower than failing the whole chat request. The added smoke coverage targets the important runtime paths.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 15fa3670959d.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.
  • add rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.
  • remove rating: 🧂 unranked krab: Current PR rating is rating: 🦞 diamond lobster, so this older rating label is no longer current.
  • remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P2: This is a normal-priority runtime resilience fix for Ask Molty retrieval with limited blast radius and focused coverage.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.
  • proof: sufficient: Contributor real behavior proof is sufficient. A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.
Evidence reviewed

What I checked:

  • Repository policy read: The full target AGENTS.md was read; its retrieval-change guidance led this review to check src/retrieval.ts, src/types.ts, scripts/smoke.ts, wrangler.toml, and README together. (AGENTS.md:48, 15fa3670959d)
  • Current main failure path: On current main, buildWorkspace awaits loadText(env.DOCS_CORPUS_URL ?? docsCorpusUrl, 1000) without a catch while source and GitHub index loads are caught, so a docs corpus fetch failure rejects the whole workspace build. (src/retrieval.ts:9, 15fa3670959d)
  • PR outage handling: The PR head wraps docs loading in loadDocsRecords(env).catch(...), keeps source/GitHub loads independent, and adds a docs-unavailable workspace file when docs records cannot be loaded. (src/retrieval.ts:12, 9fc511dc594c)
  • PR fallback implementation: The PR head adds docsSearchIndexUrl, retry-backed loadDocsRecords, loadDocsCorpus, and docsRecordsFromSearchIndex so the published docs search index is primary while corpus URLs remain compatibility fallbacks. (src/retrieval.ts:147, 9fc511dc594c)
  • Smoke coverage: The PR smoke script calls buildWorkspace with mocked network responses and checks docs index success, corpus fallback after index failure, and full docs outage with source/GitHub context still mounted. (scripts/smoke.ts:36, 9fc511dc594c)
  • Config surface: The PR adds DOCS_INDEX_URL to the Worker environment type and wrangler.toml, and updates README wording for the docs index/corpus/workspace URL defaults. (wrangler.toml:14, 9fc511dc594c)

Likely related people:

  • steipete: Peter Steinberger authored the current retrieval implementation, smoke script history, canonical docs-host updates, and the latest two commits on this PR head. (role: recent area contributor; confidence: high; commits: 32f5154dde10, d0663995057c, 6e0f3d3292cb; files: src/retrieval.ts, scripts/smoke.ts, wrangler.toml)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. labels Jun 6, 2026
@vyctorbrzezowski

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 6, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@steipete

steipete commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Maintainer proof for current head 9fc511d.

Local gates run from /Users/steipete/Projects/ask-molty:

npm run typecheck
npm run lint
npm run format:check
ASK_MOLTY_DOCS_REPO=/Users/steipete/Projects/docs-openclaw \
ASK_MOLTY_SOURCE_REPO=/Users/steipete/Projects/clawdbot5 \
ASK_MOLTY_GITCRAWL_DB=/Users/steipete/.config/gitcrawl/stores/gitcrawl-store/data/openclaw__openclaw.sync.db \
ASK_MOLTY_OUT_DIR=dist/test npm run export && npm run smoke && git diff --check

Live export used the local OpenClaw docs/source/gitcrawl dataset:

docs records: 685
source records: 18455
github records: 32890
output: dist/test
19046 files, 227.64 MiB

Runtime retrieval smoke proof:

runtime retrieval ok: docs-search.json mounted without corpus fetch
runtime retrieval ok: docs corpus fallback mounted after index failure
runtime retrieval ok: docs outage keeps source and GitHub context
ask-molty smoke ok: 19228 workspace files

Autoreview rerun after the accepted fix:

autoreview clean: no accepted/actionable findings reported
overall: patch is correct (0.86)
tests exit: 0 after 150s

Also fixed one review edge while validating: relative docs-search.json entry URLs now resolve against the configured DOCS_INDEX_URL origin, so staging/test index overrides do not cite production docs accidentally.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 7, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 7, 2026
@steipete steipete merged commit 5d1f14e into openclaw:main Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants