Skip to content

[ENG-40451] Fix AzureAsyncStorageClient.readBlob truncating files to last chunk#186

Open
tiennguyen-onehouse wants to merge 3 commits intomainfrom
eng-40451-fix-azure-readblob-truncation
Open

[ENG-40451] Fix AzureAsyncStorageClient.readBlob truncating files to last chunk#186
tiennguyen-onehouse wants to merge 3 commits intomainfrom
eng-40451-fix-azure-readblob-truncation

Conversation

@tiennguyen-onehouse
Copy link
Copy Markdown

Summary

  • fileClient.read() returns Flux<ByteBuffer> — one buffer per HTTP chunk delivered by the Azure SDK Netty client (~8 KB each). The previous blockLast().array() silently discarded every chunk except the final one, leaving any file larger than a single download chunk uploaded to the lakeview mirror bucket truncated to just its tail.
  • Replace with BinaryData.fromFlux(fileClient.read()).block() so the entire response stream is aggregated.
  • Add a multi-chunk unit test that exercises the Flux-aggregation path.

Incident context (ENG-40451)

PagerDuty #65265CEMetricsIngestionJobProcessingFailure, Sev1, for org 06bf5da5-ce19-42ba-857e-d0e0b4442718, table airflow_spark_jobs_db/af_scheduled_table_1.

CP community-edition saw:

HoodieIOException: Unable to read metadata for instant [20260331220335439__deltacommit__COMPLETED__...]
Caused by: JsonParseException: Unrecognized token 'mp' at line 1, column 1

Root cause trace:

source file size first bytes
Customer Azure .hoodie/20260331220335439.deltacommit 8 279 bytes 7b 0a 20 20 22 70{\n "p valid JSON
Mirror gs://onehouse-lakeview-production/.../20260331220335439.deltacommit 87 bytes 6d 70 5c 22 2c 5cmp\",\" exact tail of original

8279 − 87 = 8192 — one Azure HTTP download chunk. The mirror got only the final blockLast() chunk; mp is where the middle of the JSON string "writeSchema": "{...model_line..." happened to split.

Test plan

  • ./gradlew :lakeview:test --tests ai.onehouse.storage.AzureAsyncStorageClientTest — all 14 pass, new testReadBlobMultipleChunks in particular.
  • Cherry-pick to release-v0.27.x and redeploy the customer-side extractor.
  • Re-upload affected .hoodie/*.deltacommit files (> one chunk in size) in gs://onehouse-lakeview-production/ for previously-corrupted tables — at minimum 06bf5da5-…/420f16de-…/airflow_spark_jobs_db/af_scheduled_table_1/.

🤖 Generated with Claude Code

…last chunk

fileClient.read() returns Flux<ByteBuffer> — one buffer per HTTP chunk
delivered by the Azure SDK Netty client (~8 KB each). The previous code
called blockLast().array(), which silently discarded every chunk except
the final one, leaving any file larger than a single download chunk
uploaded to the lakeview mirror bucket truncated to just its tail.

Replace with BinaryData.fromFlux(...).block() so the entire response
stream is aggregated. Add a multi-chunk unit test that would have
caught the bug.

Symptom traced from PagerDuty incident #65265: CEMetricsIngestionJobProcessingFailure
— CP community-edition saw Jackson "Unrecognized token 'mp'" parsing an
8 279-byte deltacommit whose mirror copy in
gs://onehouse-lakeview-production was only 87 bytes (exactly the tail
chunk after the first 8 192-byte buffer was dropped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nimahajan
Copy link
Copy Markdown

@tiennguyen-onehouse tiennguyen-onehouse marked this pull request as ready for review April 17, 2026 22:35
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Simulate the actual truncation scenario: 8192-byte first chunk followed
by an 87-byte tail starting with "mp" (the token Jackson choked on).
Verifies the full content is aggregated, not just the last chunk.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants