Fix models not appearing: stop /api/models from re-downloading the catalog every call by michaelneale · Pull Request #776 · Mesh-LLM/mesh-llm

michaelneale · 2026-06-02T01:52:49Z

What this fixes

Models stopped showing up in the console because /api/models was hanging ~9s on every call and silently serving a stale (or empty) catalog.

Each /api/models request walks ensure_catalog(), which saw the staleness marker as expired and kicked off a synchronous HuggingFace download of the meshllm/catalog dataset. That download failed on the very first small file with a spurious size mismatch:

failed to refresh stale meshllm/catalog; using already-loaded stale catalog:
downloaded 818 bytes but expected 341 bytes for entries/.../*.json

The refresh aborted before touching the .last_refresh marker, so ensure_catalog() fell back to the stale cache and returned Ok — but the marker stayed stale, so the next request re-attempted the same multi-second download. That loop repeated forever, ~9s per call.

This was pre-existing on main (verified by building and timing a clean main checkout), not a regression from #773.

Root cause + the two fixes

1. The actual bug (hf-hub fork). The cache download path HEADs the resolve endpoint with a no-redirect client and gets a 307 whose own Content-Length (341) is the redirect body, not the file. With no X-Linked-Size, the fork trusted that length, then rejected the correctly-downloaded 818-byte file. Fixed in Mesh-LLM/hf-hub#2 — extract_file_size() no longer trusts a redirect's Content-Length. This PR points hf-hub at that fix branch.

2. Defense-in-depth (this crate). Add a 5-minute refresh backoff so that any future refresh failure (HF outage, auth, etc.) can't turn every request into a fresh network download when a stale catalog is already loaded. After a failed refresh with a loaded catalog, retries are suppressed for 5 minutes; a successful refresh clears the backoff immediately.

Validation

cargo fmt --all -- --check, cargo clippy -p mesh-llm-host-runtime --all-targets -- -D warnings clean.
New unit test refresh_backoff_suppresses_then_clears passes.
New ignored networked test refresh_catalog_live now downloads the live meshllm/catalog dataset end-to-end (80 entries); it fails on entry macos menu app #1 without the hf-hub fix.

Follow-up

Cargo.toml temporarily tracks the fork's micn/fix-redirect-content-length-size branch. Once Mesh-LLM/hf-hub#2 merges into the fork's mesh-llm branch, repoint hf-hub back to branch = "mesh-llm" and cargo update -p hf-hub.

The catalog refresh failed on every request because the hf-hub fork mistook a 307 redirect's Content-Length for the target file size, tripping the post-download size check. ensure_catalog() then silently fell back to the stale cache without refreshing the staleness marker, so every /api/models call re-attempted the multi-second download forever (~9s per call). - Point hf-hub at the redirect Content-Length fix branch (Mesh-LLM/hf-hub#2). - Add a 5-minute refresh backoff so a failing refresh can't turn every request into a fresh network download when a stale catalog is already loaded. - Add a unit test for the backoff and an ignored networked test that verifies the live meshllm/catalog dataset now downloads end-to-end.

Copilot

Pull request overview

Fixes /api/models re-downloading the HuggingFace catalog on every call by (a) pointing hf-hub at a fork branch that fixes a redirect Content-Length mis-detection, and (b) adding a 5-minute refresh backoff so any future refresh failure (with a stale catalog already loaded) doesn't trigger a network attempt per request.

Changes:

Add CATALOG_REFRESH_BACKOFF_UNTIL state + helpers; consult it in ensure_catalog() when stale and a cached catalog exists; clear on successful refresh.
Repoint hf-hub patch to the micn/fix-redirect-content-length-size fork branch.
Add unit test for backoff lifecycle and an ignored networked test for live catalog refresh.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File	Description
crates/mesh-llm-host-runtime/src/models/remote_catalog.rs	Implements refresh backoff with helpers, integrates into `ensure_catalog()`, and adds tests.
Cargo.toml	Temporarily repoints `hf-hub` patch to the fix branch (follow-up to revert noted in description).
Cargo.lock	Updated to pick up the new `hf-hub` commit (and incidental dependency churn).

The hf-hub redirect fix makes the remote catalog actually downloadable in CI again. parse_exact_model_ref consults the live catalog before the Hugging Face parser branches, so the parse_exact_model_ref_accepts_* tests now match a real catalog entry (e.g. unsloth/gemma-4-31B-it-GGUF) and return ExactModelRef::Catalog instead of the HuggingFace ref they assert. These tests were only passing because catalog downloads were broken. Install an empty catalog override (and mark serial, since the override is global) so the parser branches are tested in isolation.

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings June 2, 2026 01:52

Copilot started reviewing on behalf of michaelneale June 2, 2026 01:52 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

michaelneale added 2 commits June 2, 2026 12:08

deps: track merged hf-hub redirect size fix on fork mesh-llm branch

f619eff

Copilot AI review requested due to automatic review settings June 2, 2026 03:05

Copilot started reviewing on behalf of michaelneale June 2, 2026 03:06 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

michaelneale merged commit c221017 into main Jun 2, 2026
36 checks passed

michaelneale deleted the micn/model-load-last-fix branch June 2, 2026 03:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix models not appearing: stop /api/models from re-downloading the catalog every call#776

Fix models not appearing: stop /api/models from re-downloading the catalog every call#776
michaelneale merged 3 commits into
mainfrom
micn/model-load-last-fix

michaelneale commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

michaelneale commented Jun 2, 2026

What this fixes

Root cause + the two fixes

Validation

Follow-up

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants