Fix intermittent wrong-MIME / truncated responses under load (transient node-error resilience)#351
Open
damip wants to merge 4 commits into
Open
Fix intermittent wrong-MIME / truncated responses under load (transient node-error resilience)#351damip wants to merge 4 commits into
damip wants to merge 4 commits into
Conversation
Under load (the gossip site sets Cache-Control: no-store, so browsers re-fetch every resource on each visit), the public Massa node intermittently returns 503s. This caused the provider to serve wrong/truncated responses: - RequestFile dereferenced a nil lastUpdated pointer when the last-update timestamp call failed, panicking the handler and resetting the connection (a truncated/empty binary in the browser). - A failed timestamp call bypassed the cache and forced a fresh fetch; if that also failed, the broken-website HTML page was served in place of the asset (wrong MIME type, e.g. text/html instead of application/wasm, breaking WebAssembly.instantiateStreaming). - resolveResourceName and MNS resolution aborted on transient node errors, serving the broken/domain-not-found page even when the asset was cached. Fixes: - Never dereference a nil lastUpdated; serve cached content when the timestamp call fails, and store a zero time so the entry refreshes once the node recovers. - resolveResourceName falls back to the requested resource as-is on a transient existence-check error instead of silently serving index.html. - Add a non-expiring last-known-good MNS fallback used only when fresh resolution fails. - Cache the last-update timestamp for a short, configurable TTL (last_update_cache_duration_seconds, default 10s) to collapse the per-request node call while keeping on-chain updates visible within the TTL. Adds unit and end-to-end tests for the timestamp cache. Co-authored-by: Cursor <cursoragent@cursor.com>
CI regenerates the Swagger API code via `task generate`, which used swagger@latest. swagger >= v0.34.0 emits imports for the new split go-openapi/swag sub-packages (swag/jsonutils, swag/typeutils, swag/netutils, swag/cmdutils) that are not provided by the go-openapi/swag version pinned in go.mod, breaking the build/test/lint jobs. Pin the generator to v0.33.0, the last release that emits the single-package swag import compatible with the pinned dependencies. Also restructure the cache-decision branch in RequestFile back to if/else and add the blank lines required by the wsl linter in the new test. Co-authored-by: Cursor <cursoragent@cursor.com>
CI regenerates the Swagger API code via `task generate`. The generator was installed with swagger@latest, so once go-swagger v0.34.0 became latest it emitted imports for the new split go-openapi/swag sub-packages (swag/jsonutils, swag/typeutils, swag/netutils, swag/cmdutils) that the pinned go-openapi/swag did not provide, breaking build/test/lint. Rather than freezing the generator on the last single-package release, move forward cleanly: - Pin the generator to swagger v0.34.0 for reproducible generation. - Regenerate the api/read code against the split swag packages. - Add the swag sub-packages to go.mod, pinned to v0.25.5 (the latest line that targets Go 1.24 rather than v0.26.0 which requires Go 1.25 and would in turn force a golangci-lint v2 upgrade). - Bump the module and CI Go version 1.23 -> 1.24 to match. Co-authored-by: Cursor <cursoragent@cursor.com>
The plugin module consumes the server's generated Swagger code via the
`replace github.com/massalabs/deweb-server => ../server` directive. After the
server adopted the split go-openapi/swag sub-packages, the plugin's go.mod was
out of date ("go: updates to go.mod needed"), failing `task generate`. Tidy
the plugin module so it requires the same swag sub-packages (v0.25.5). The
plugin already targets Go 1.24, so no Go version bump is needed here.
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The DeWeb provider intermittently served websites incorrectly — most visibly on
gossip.massa.network, where the WASM bundle would sometimes come back with the wrong MIME type or as a truncated/empty binary, causing the app to show "Failed to start. Please restart the app." It worked sometimes and failed at other times, fairly consistently in bursts.Root cause
The gossip site sets
Cache-Control: no-store, no-cacheon-chain, so browsers re-fetch every resource on every visit (dozens of requests, highly concurrent). On top of that, the provider performs aLAST_UPDATEtimestamp lookup on every single request. This bursts the public Massa node (mainnet.massa.net), which then intermittently answers HTTP 503 (with an HTML body that fails JSON decoding).The provider was not resilient to those transient node errors, leading to three failure modes:
RequestFile, whenGetLastUpdateTimestampfailed,lastUpdatedwasnilbut the code still didcache.Save(..., *lastUpdated, ...). The nil dereference panicked the HTTP handler, resetting the connection mid-response → the browser received a truncated/empty.wasm.brokenWebsiteHTML page (Content-Type: text/html) was served in place of the asset. The browser got HTML where it expectedapplication/wasm, soWebAssembly.instantiateStreamingrejected it. (The same class of issue could serveindex.htmlfor an asset path, which is why opening a.wasmURL directly appeared to "redirect" to the base URL — the SPA booted from that HTML and routed to/.)resolveResourceNameand MNS domain resolution aborted on any transient node error, serving the broken/domain-not-found page even when a valid copy of the asset was already cached.Reproduction
Running the provider locally against mainnet and replaying a realistic page-load burst (warm cache, then concurrent re-fetches of all resources) reproduced it reliably: 5,494 / 6,400 requests failed, including handler panics, whenever the node started returning 503s.
Fix
Make serving resilient to transient node errors, and stop hammering the node:
pkg/webmanager/manager.go): never dereference a nillastUpdated; when the timestamp call fails, serve the cached copy instead of bypassing the cache, and store a zero time so the entry is refreshed as soon as the node recovers.int/api/middlewares.go): on a transient existence-check error,resolveResourceNamereturns the originally requested resource as-is (so it can still be served from cache) instead of silently substitutingindex.htmlwith the wrong MIME type.pkg/mns/cache/cache.go+resolveAddress): added a non-expiring fallback resolution, used only when a fresh MNS resolution fails, so a previously resolved site keeps serving during a node hiccup instead of returning "domain not found".pkg/website/read.go,int/api/config/cache.go): cache theLAST_UPDATEtimestamp for a short, configurable period (last_update_cache_duration_seconds, default 10s; set to0to disable). This collapses the per-request node call (the main source of overload) into roughly one call per website per TTL, while keeping on-chain website updates visible within the TTL. Errors are never cached, so a recovering node is picked up immediately.This is a deliberate, bounded freshness-vs-load tradeoff: an on-chain update becomes visible within at most
last_update_cache_duration_seconds. It is consistent with the existing 60s file-list cache and is tunable per deployment.Results
Same heavy load test, after the fix:
Build tooling / CI
While preparing this PR, CI started failing for a reason unrelated to the runtime fix above, so it is fixed here as well:
server/api/read/...) viatask generate, and the generator was installed withswagger@latest. Once go-swagger v0.34.0 became latest, it began emitting imports for the new splitgo-openapi/swagsub-packages (swag/jsonutils,swag/typeutils,swag/netutils,swag/cmdutils), which the pinned single-packagego-openapi/swagdid not provide — breaking the build, test, and lint jobs simultaneously (the lint job'sgolangci-lint exit code 3was a typecheck failure, not a style issue).server/Taskfile.ymlfor reproducible generation.api/readagainst the splitswagpackages.swagsub-packages togo.mod, pinned to v0.25.5 — the latest line that targets Go 1.24, avoidingswag v0.26.0which requires Go 1.25 and would in turn force a golangci-lint v2 migration.setup-goversion 1.23 → 1.24 to match.Tests
-race;go build,go test ./...,gofumpt, andgolangci-lint v1.64all pass locally against the regenerated tree.Test plan
cd server && task test(orgo test -race ./pkg/... ./int/...)gossip.massa.network; confirm the WASM is served asapplication/wasmand the app starts reliably under repeated reloads.last_update_cache_duration_seconds.Made with Cursor