Skip to content

Fix intermittent wrong-MIME / truncated responses under load (transient node-error resilience)#351

Open
damip wants to merge 4 commits into
mainfrom
fix/deweb-transient-node-error-resilience
Open

Fix intermittent wrong-MIME / truncated responses under load (transient node-error resilience)#351
damip wants to merge 4 commits into
mainfrom
fix/deweb-transient-node-error-resilience

Conversation

@damip
Copy link
Copy Markdown
Member

@damip damip commented May 30, 2026

Problem

The DeWeb provider intermittently served websites incorrectly — most visibly on gossip.massa.network, where the WASM bundle would sometimes come back with the wrong MIME type or as a truncated/empty binary, causing the app to show "Failed to start. Please restart the app." It worked sometimes and failed at other times, fairly consistently in bursts.

Root cause

The gossip site sets Cache-Control: no-store, no-cache on-chain, so browsers re-fetch every resource on every visit (dozens of requests, highly concurrent). On top of that, the provider performs a LAST_UPDATE timestamp lookup on every single request. This bursts the public Massa node (mainnet.massa.net), which then intermittently answers HTTP 503 (with an HTML body that fails JSON decoding).

The provider was not resilient to those transient node errors, leading to three failure modes:

  1. Nil-pointer panic → truncated/empty binary. In RequestFile, when GetLastUpdateTimestamp failed, lastUpdated was nil but the code still did cache.Save(..., *lastUpdated, ...). The nil dereference panicked the HTTP handler, resetting the connection mid-response → the browser received a truncated/empty .wasm.
  2. Broken-page HTML → wrong MIME type. A failed timestamp call caused the cache to be bypassed and a fresh chain fetch forced. If that fetch also hit a transient error, the embedded brokenWebsite HTML page (Content-Type: text/html) was served in place of the asset. The browser got HTML where it expected application/wasm, so WebAssembly.instantiateStreaming rejected it. (The same class of issue could serve index.html for an asset path, which is why opening a .wasm URL directly appeared to "redirect" to the base URL — the SPA booted from that HTML and routed to /.)
  3. Whole-site failures on transient resolution errors. resolveResourceName and MNS domain resolution aborted on any transient node error, serving the broken/domain-not-found page even when a valid copy of the asset was already cached.

Reproduction

Running the provider locally against mainnet and replaying a realistic page-load burst (warm cache, then concurrent re-fetches of all resources) reproduced it reliably: 5,494 / 6,400 requests failed, including handler panics, whenever the node started returning 503s.

Fix

Make serving resilient to transient node errors, and stop hammering the node:

  • No nil deref / serve stale on error (pkg/webmanager/manager.go): never dereference a nil lastUpdated; when the timestamp call fails, serve the cached copy instead of bypassing the cache, and store a zero time so the entry is refreshed as soon as the node recovers.
  • Tolerant resource resolution (int/api/middlewares.go): on a transient existence-check error, resolveResourceName returns the originally requested resource as-is (so it can still be served from cache) instead of silently substituting index.html with the wrong MIME type.
  • MNS last-known-good fallback (pkg/mns/cache/cache.go + resolveAddress): added a non-expiring fallback resolution, used only when a fresh MNS resolution fails, so a previously resolved site keeps serving during a node hiccup instead of returning "domain not found".
  • Short-TTL last-update timestamp cache (pkg/website/read.go, int/api/config/cache.go): cache the LAST_UPDATE timestamp for a short, configurable period (last_update_cache_duration_seconds, default 10s; set to 0 to disable). This collapses the per-request node call (the main source of overload) into roughly one call per website per TTL, while keeping on-chain website updates visible within the TTL. Errors are never cached, so a recovering node is picked up immediately.

This is a deliberate, bounded freshness-vs-load tradeoff: an on-chain update becomes visible within at most last_update_cache_duration_seconds. It is consistent with the existing 60s file-list cache and is tunable per deployment.

Results

Same heavy load test, after the fix:

  • 0 / 6,400 anomalies, 0 panics, even though the node still returned thousands of transient errors during the run — they are now absorbed by the cache / fallback layers.
  • With the timestamp cache, the page-load burst no longer reaches the node at all: the load test went from ~63s → ~4.6s and produced 0 node 503s.

Build tooling / CI

While preparing this PR, CI started failing for a reason unrelated to the runtime fix above, so it is fixed here as well:

  • CI regenerates the Swagger API code (server/api/read/...) via task generate, and the generator was installed with swagger@latest. Once go-swagger v0.34.0 became latest, it began emitting imports for the new split go-openapi/swag sub-packages (swag/jsonutils, swag/typeutils, swag/netutils, swag/cmdutils), which the pinned single-package go-openapi/swag did not provide — breaking the build, test, and lint jobs simultaneously (the lint job's golangci-lint exit code 3 was a typecheck failure, not a style issue).
  • Fixed by moving forward cleanly rather than freezing on the old generator:
    • Pin the generator to swagger v0.34.0 in server/Taskfile.yml for reproducible generation.
    • Regenerate api/read against the split swag packages.
    • Add the swag sub-packages to go.mod, pinned to v0.25.5 — the latest line that targets Go 1.24, avoiding swag v0.26.0 which requires Go 1.25 and would in turn force a golangci-lint v2 migration.
    • Bump the module and the CI setup-go version 1.23 → 1.24 to match.

Tests

  • New unit tests for the timestamp cache: hit, expiry, configured TTL, and disabled (TTL = 0).
  • New end-to-end test (with a fake JSON-RPC node) proving the timestamp is cached within the TTL (no extra node calls, update not yet visible) and refetched after the TTL (the on-chain update IS picked up).
  • Full suite passes with -race; go build, go test ./..., gofumpt, and golangci-lint v1.64 all pass locally against the regenerated tree.

Test plan

  • cd server && task test (or go test -race ./pkg/... ./int/...)
  • Run a provider instance and load gossip.massa.network; confirm the WASM is served as application/wasm and the app starts reliably under repeated reloads.
  • Update a website on-chain and confirm the change is served within last_update_cache_duration_seconds.

Made with Cursor

damip and others added 4 commits May 30, 2026 13:24
Under load (the gossip site sets Cache-Control: no-store, so browsers
re-fetch every resource on each visit), the public Massa node intermittently
returns 503s. This caused the provider to serve wrong/truncated responses:

- RequestFile dereferenced a nil lastUpdated pointer when the last-update
  timestamp call failed, panicking the handler and resetting the connection
  (a truncated/empty binary in the browser).
- A failed timestamp call bypassed the cache and forced a fresh fetch; if that
  also failed, the broken-website HTML page was served in place of the asset
  (wrong MIME type, e.g. text/html instead of application/wasm, breaking
  WebAssembly.instantiateStreaming).
- resolveResourceName and MNS resolution aborted on transient node errors,
  serving the broken/domain-not-found page even when the asset was cached.

Fixes:
- Never dereference a nil lastUpdated; serve cached content when the timestamp
  call fails, and store a zero time so the entry refreshes once the node recovers.
- resolveResourceName falls back to the requested resource as-is on a transient
  existence-check error instead of silently serving index.html.
- Add a non-expiring last-known-good MNS fallback used only when fresh
  resolution fails.
- Cache the last-update timestamp for a short, configurable TTL
  (last_update_cache_duration_seconds, default 10s) to collapse the per-request
  node call while keeping on-chain updates visible within the TTL.

Adds unit and end-to-end tests for the timestamp cache.

Co-authored-by: Cursor <cursoragent@cursor.com>
CI regenerates the Swagger API code via `task generate`, which used
swagger@latest. swagger >= v0.34.0 emits imports for the new split
go-openapi/swag sub-packages (swag/jsonutils, swag/typeutils, swag/netutils,
swag/cmdutils) that are not provided by the go-openapi/swag version pinned in
go.mod, breaking the build/test/lint jobs. Pin the generator to v0.33.0, the
last release that emits the single-package swag import compatible with the
pinned dependencies.

Also restructure the cache-decision branch in RequestFile back to if/else and
add the blank lines required by the wsl linter in the new test.

Co-authored-by: Cursor <cursoragent@cursor.com>
CI regenerates the Swagger API code via `task generate`. The generator was
installed with swagger@latest, so once go-swagger v0.34.0 became latest it
emitted imports for the new split go-openapi/swag sub-packages
(swag/jsonutils, swag/typeutils, swag/netutils, swag/cmdutils) that the
pinned go-openapi/swag did not provide, breaking build/test/lint.

Rather than freezing the generator on the last single-package release, move
forward cleanly:
- Pin the generator to swagger v0.34.0 for reproducible generation.
- Regenerate the api/read code against the split swag packages.
- Add the swag sub-packages to go.mod, pinned to v0.25.5 (the latest line
  that targets Go 1.24 rather than v0.26.0 which requires Go 1.25 and would
  in turn force a golangci-lint v2 upgrade).
- Bump the module and CI Go version 1.23 -> 1.24 to match.

Co-authored-by: Cursor <cursoragent@cursor.com>
The plugin module consumes the server's generated Swagger code via the
`replace github.com/massalabs/deweb-server => ../server` directive. After the
server adopted the split go-openapi/swag sub-packages, the plugin's go.mod was
out of date ("go: updates to go.mod needed"), failing `task generate`. Tidy
the plugin module so it requires the same swag sub-packages (v0.25.5). The
plugin already targets Go 1.24, so no Go version bump is needed here.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant