fix: tolerate docs corpus outages by vyctorbrzezowski · Pull Request #1 · openclaw/ask-molty

vyctorbrzezowski · 2026-06-06T18:34:25Z

Summary

Use the published docs-search.json as the primary docs retrieval index so each chat request no longer depends on downloading llms-full.txt.
Keep llms-full.txt and /.well-known/llms-full.txt as compatibility fallbacks when the search index is unavailable or cannot answer an otherwise empty workspace query.
Keep building a workspace with source/GitHub context plus a mounted docs-unavailable note when docs retrieval is down.
Add smoke coverage for the runtime retrieval paths: usable docs index, docs-index failure with corpus fallback, and full docs outage with source/GitHub still mounted.

Related UI boundary PR: openclaw/docs#36.

Validation

npm run typecheck
npm run lint
npm run format:check
ASK_MOLTY_DOCS_REPO=... ASK_MOLTY_SOURCE_REPO=... ASK_MOLTY_GITCRAWL_DB=dist/gitcrawl-fixture.db ASK_MOLTY_OUT_DIR=dist/test npm run export && npm run smoke
git diff --check

clawsweeper · 2026-06-06T18:35:36Z

Codex review: needs maintainer review before merge. Reviewed June 7, 2026, 1:55 AM ET / 05:55 UTC.

Summary
The PR changes Ask Molty retrieval to prefer docs-search.json, retry/fallback to llms-full.txt, preserve source/GitHub context during docs outages, add DOCS_INDEX_URL, and cover the runtime paths in smoke tests.

Reproducibility: yes. from source inspection: current main awaits the docs corpus fetch without a catch, so a 503 or invalid corpus response aborts buildWorkspace before usable source/GitHub context can be mounted. I did not run a local repro because this review is read-only, but the PR smoke test models the failing path directly.

Review metrics: 2 noteworthy metrics.

Changed surface: 5 files, +309/-13. The PR touches runtime retrieval, smoke coverage, env typing/config, and README docs for one bounded outage-handling change.
Runtime smoke paths: 3 scenarios added. The added smoke coverage exercises the docs index path, corpus fallback, and full docs outage behavior that the PR is meant to protect.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Next step before merge

No ClawSweeper repair lane is needed; the current head has targeted proof and no actionable code finding from this read-only review.

Security
Cleared: The diff changes Worker fetch fallback logic and configuration only; I found no concrete credential, auth, dependency, workflow, or supply-chain regression.

Review details

Best possible solution:

Merge the focused resiliency path after normal maintainer checks, keeping the docs search index primary with corpus compatibility fallback and a clear docs-unavailable workspace note.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection: current main awaits the docs corpus fetch without a catch, so a 503 or invalid corpus response aborts buildWorkspace before usable source/GitHub context can be mounted. I did not run a local repro because this review is read-only, but the PR smoke test models the failing path directly.

Is this the best way to solve the issue?

Yes: preferring the smaller published docs index, falling back to the legacy corpus, and mounting an explicit docs-unavailable note is narrower than failing the whole chat request. The added smoke coverage targets the important runtime paths.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 15fa3670959d.

Label changes

Label changes:

add proof: sufficient: Contributor real behavior proof is sufficient. A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.
add rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.
remove rating: 🧂 unranked krab: Current PR rating is rating: 🦞 diamond lobster, so this older rating label is no longer current.
remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

P2: This is a normal-priority runtime resilience fix for Ask Molty retrieval with limited blast radius and focused coverage.
rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.
proof: sufficient: Contributor real behavior proof is sufficient. A maintainer posted terminal proof for current head 9fc511d showing gates and live runtime smoke output for the docs index, corpus fallback, and outage paths.

Evidence reviewed

What I checked:

Repository policy read: The full target AGENTS.md was read; its retrieval-change guidance led this review to check src/retrieval.ts, src/types.ts, scripts/smoke.ts, wrangler.toml, and README together. (AGENTS.md:48, 15fa3670959d)
Current main failure path: On current main, buildWorkspace awaits loadText(env.DOCS_CORPUS_URL ?? docsCorpusUrl, 1000) without a catch while source and GitHub index loads are caught, so a docs corpus fetch failure rejects the whole workspace build. (src/retrieval.ts:9, 15fa3670959d)
PR outage handling: The PR head wraps docs loading in loadDocsRecords(env).catch(...), keeps source/GitHub loads independent, and adds a docs-unavailable workspace file when docs records cannot be loaded. (src/retrieval.ts:12, 9fc511dc594c)
PR fallback implementation: The PR head adds docsSearchIndexUrl, retry-backed loadDocsRecords, loadDocsCorpus, and docsRecordsFromSearchIndex so the published docs search index is primary while corpus URLs remain compatibility fallbacks. (src/retrieval.ts:147, 9fc511dc594c)
Smoke coverage: The PR smoke script calls buildWorkspace with mocked network responses and checks docs index success, corpus fallback after index failure, and full docs outage with source/GitHub context still mounted. (scripts/smoke.ts:36, 9fc511dc594c)
Config surface: The PR adds DOCS_INDEX_URL to the Worker environment type and wrangler.toml, and updates README wording for the docs index/corpus/workspace URL defaults. (wrangler.toml:14, 9fc511dc594c)

Likely related people:

steipete: Peter Steinberger authored the current retrieval implementation, smoke script history, canonical docs-host updates, and the latest two commits on this PR head. (role: recent area contributor; confidence: high; commits: 32f5154dde10, d0663995057c, 6e0f3d3292cb; files: src/retrieval.ts, scripts/smoke.ts, wrangler.toml)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

vyctorbrzezowski · 2026-06-06T19:14:37Z

@clawsweeper re-review

clawsweeper · 2026-06-06T19:14:40Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/27071419617
Updated: 2026-06-06T19:20:45.602Z

steipete · 2026-06-07T05:49:56Z

Maintainer proof for current head 9fc511d.

Local gates run from /Users/steipete/Projects/ask-molty:

npm run typecheck
npm run lint
npm run format:check
ASK_MOLTY_DOCS_REPO=/Users/steipete/Projects/docs-openclaw \
ASK_MOLTY_SOURCE_REPO=/Users/steipete/Projects/clawdbot5 \
ASK_MOLTY_GITCRAWL_DB=/Users/steipete/.config/gitcrawl/stores/gitcrawl-store/data/openclaw__openclaw.sync.db \
ASK_MOLTY_OUT_DIR=dist/test npm run export && npm run smoke && git diff --check

Live export used the local OpenClaw docs/source/gitcrawl dataset:

docs records: 685
source records: 18455
github records: 32890
output: dist/test
19046 files, 227.64 MiB

Runtime retrieval smoke proof:

runtime retrieval ok: docs-search.json mounted without corpus fetch
runtime retrieval ok: docs corpus fallback mounted after index failure
runtime retrieval ok: docs outage keeps source and GitHub context
ask-molty smoke ok: 19228 workspace files

Autoreview rerun after the accepted fix:

autoreview clean: no accepted/actionable findings reported
overall: patch is correct (0.86)
tests exit: 0 after 150s

Also fixed one review edge while validating: relative docs-search.json entry URLs now resolve against the configured DOCS_INDEX_URL origin, so staging/test index overrides do not cite production docs accidentally.

@clawsweeper re-review

clawsweeper · 2026-06-07T05:49:59Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/27084214573
Updated: 2026-06-07T05:55:53.217Z

fix: tolerate docs corpus outages

72d00bf

vyctorbrzezowski force-pushed the vh/molty-corpus-fetch-resilience branch from bd1055f to 72d00bf Compare June 6, 2026 19:12

vyctorbrzezowski mentioned this pull request Jun 6, 2026

fix(docs): soften transient Molty failures openclaw/docs#36

Merged

steipete added 2 commits June 6, 2026 22:42

test: print retrieval fallback proof

6e0f3d3

fix: resolve docs index links from override origin

9fc511d

steipete merged commit 5d1f14e into openclaw:main Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tolerate docs corpus outages#1

fix: tolerate docs corpus outages#1
steipete merged 3 commits into
openclaw:mainfrom
vyctorbrzezowski:vh/molty-corpus-fetch-resilience

vyctorbrzezowski commented Jun 6, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

vyctorbrzezowski commented Jun 6, 2026

Uh oh!

clawsweeper Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

steipete commented Jun 7, 2026

Uh oh!

clawsweeper Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vyctorbrzezowski commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

clawsweeper Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vyctorbrzezowski commented Jun 6, 2026

Uh oh!

clawsweeper Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steipete commented Jun 7, 2026

Uh oh!

clawsweeper Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vyctorbrzezowski commented Jun 6, 2026 •

edited

Loading

clawsweeper Bot commented Jun 6, 2026 •

edited

Loading

clawsweeper Bot commented Jun 6, 2026 •

edited

Loading

clawsweeper Bot commented Jun 7, 2026 •

edited

Loading