[ENG-40451] Fix AzureAsyncStorageClient.readBlob truncating files to last chunk#186
Open
tiennguyen-onehouse wants to merge 3 commits intomainfrom
Open
[ENG-40451] Fix AzureAsyncStorageClient.readBlob truncating files to last chunk#186tiennguyen-onehouse wants to merge 3 commits intomainfrom
tiennguyen-onehouse wants to merge 3 commits intomainfrom
Conversation
…last chunk fileClient.read() returns Flux<ByteBuffer> — one buffer per HTTP chunk delivered by the Azure SDK Netty client (~8 KB each). The previous code called blockLast().array(), which silently discarded every chunk except the final one, leaving any file larger than a single download chunk uploaded to the lakeview mirror bucket truncated to just its tail. Replace with BinaryData.fromFlux(...).block() so the entire response stream is aggregated. Add a multi-chunk unit test that would have caught the bug. Symptom traced from PagerDuty incident #65265: CEMetricsIngestionJobProcessingFailure — CP community-edition saw Jackson "Unrecognized token 'mp'" parsing an 8 279-byte deltacommit whose mirror copy in gs://onehouse-lakeview-production was only 87 bytes (exactly the tail chunk after the first 8 192-byte buffer was dropped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dharmendersheshma
approved these changes
Apr 18, 2026
Simulate the actual truncation scenario: 8192-byte first chunk followed by an 87-byte tail starting with "mp" (the token Jackson choked on). Verifies the full content is aggregated, not just the last chunk. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
fileClient.read()returnsFlux<ByteBuffer>— one buffer per HTTP chunk delivered by the Azure SDK Netty client (~8 KB each). The previousblockLast().array()silently discarded every chunk except the final one, leaving any file larger than a single download chunk uploaded to the lakeview mirror bucket truncated to just its tail.BinaryData.fromFlux(fileClient.read()).block()so the entire response stream is aggregated.Incident context (ENG-40451)
PagerDuty #65265 —
CEMetricsIngestionJobProcessingFailure, Sev1, for org06bf5da5-ce19-42ba-857e-d0e0b4442718, tableairflow_spark_jobs_db/af_scheduled_table_1.CP
community-editionsaw:Root cause trace:
.hoodie/20260331220335439.deltacommit7b 0a 20 20 22 70→{\n "pgs://onehouse-lakeview-production/.../20260331220335439.deltacommit6d 70 5c 22 2c 5c→mp\",\"8279 − 87 = 8192— one Azure HTTP download chunk. The mirror got only the finalblockLast()chunk;mpis where the middle of the JSON string"writeSchema": "{...model_line..."happened to split.Test plan
./gradlew :lakeview:test --tests ai.onehouse.storage.AzureAsyncStorageClientTest— all 14 pass, newtestReadBlobMultipleChunksin particular.release-v0.27.xand redeploy the customer-side extractor..hoodie/*.deltacommitfiles (> one chunk in size) ings://onehouse-lakeview-production/for previously-corrupted tables — at minimum06bf5da5-…/420f16de-…/airflow_spark_jobs_db/af_scheduled_table_1/.🤖 Generated with Claude Code