[ENG-40451] Fix AzureAsyncStorageClient.readBlob truncating files to last chunk by tiennguyen-onehouse · Pull Request #186 · onehouseinc/LakeView

tiennguyen-onehouse · 2026-04-17T10:06:11Z

Summary

fileClient.read() returns Flux<ByteBuffer> — one buffer per HTTP chunk delivered by the Azure SDK Netty client (~8 KB each). The previous blockLast().array() silently discarded every chunk except the final one, leaving any file larger than a single download chunk uploaded to the lakeview mirror bucket truncated to just its tail.
Replace with BinaryData.fromFlux(fileClient.read()).block() so the entire response stream is aggregated.
Add a multi-chunk unit test that exercises the Flux-aggregation path.

Incident context (ENG-40451)

PagerDuty #65265 — CEMetricsIngestionJobProcessingFailure, Sev1, for org 06bf5da5-ce19-42ba-857e-d0e0b4442718, table airflow_spark_jobs_db/af_scheduled_table_1.

CP community-edition saw:

HoodieIOException: Unable to read metadata for instant [20260331220335439__deltacommit__COMPLETED__...]
Caused by: JsonParseException: Unrecognized token 'mp' at line 1, column 1

Root cause trace:

source	file	size	first bytes
Customer Azure `.hoodie/20260331220335439.deltacommit`	8 279 bytes	`7b 0a 20 20 22 70` → `{\n "p`	valid JSON
Mirror `gs://onehouse-lakeview-production/.../20260331220335439.deltacommit`	87 bytes	`6d 70 5c 22 2c 5c` → `mp\",\"`	exact tail of original

8279 − 87 = 8192 — one Azure HTTP download chunk. The mirror got only the final blockLast() chunk; mp is where the middle of the JSON string "writeSchema": "{...model_line..." happened to split.

Test plan

./gradlew :lakeview:test --tests ai.onehouse.storage.AzureAsyncStorageClientTest — all 14 pass, new testReadBlobMultipleChunks in particular.
Cherry-pick to release-v0.27.x and redeploy the customer-side extractor.
Re-upload affected .hoodie/*.deltacommit files (> one chunk in size) in gs://onehouse-lakeview-production/ for previously-corrupted tables — at minimum 06bf5da5-…/420f16de-…/airflow_spark_jobs_db/af_scheduled_table_1/.

🤖 Generated with Claude Code

…last chunk fileClient.read() returns Flux<ByteBuffer> — one buffer per HTTP chunk delivered by the Azure SDK Netty client (~8 KB each). The previous code called blockLast().array(), which silently discarded every chunk except the final one, leaving any file larger than a single download chunk uploaded to the lakeview mirror bucket truncated to just its tail. Replace with BinaryData.fromFlux(...).block() so the entire response stream is aggregated. Add a multi-chunk unit test that would have caught the bug. Symptom traced from PagerDuty incident #65265: CEMetricsIngestionJobProcessingFailure — CP community-edition saw Jackson "Unrecognized token 'mp'" parsing an 8 279-byte deltacommit whose mirror copy in gs://onehouse-lakeview-production was only 87 bytes (exactly the tail chunk after the first 8 192-byte buffer was dropped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nimahajan · 2026-04-17T10:06:17Z

Task linked: ENG-40451 [FIRING:1] Failure encountered during community edition metrics ingestion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Simulate the actual truncation scenario: 8192-byte first chunk followed by an 87-byte tail starting with "mp" (the token Jackson choked on). Verifies the full content is aggregated, not just the last chunk. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sonarqubecloud · 2026-04-18T02:04:02Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

tiennguyen-onehouse marked this pull request as ready for review April 17, 2026 22:35

Remove explanatory comments about the fix

ba17846

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dharmendersheshma approved these changes Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENG-40451] Fix AzureAsyncStorageClient.readBlob truncating files to last chunk#186

[ENG-40451] Fix AzureAsyncStorageClient.readBlob truncating files to last chunk#186
tiennguyen-onehouse wants to merge 3 commits intomainfrom
eng-40451-fix-azure-readblob-truncation

tiennguyen-onehouse commented Apr 17, 2026

Uh oh!

nimahajan commented Apr 17, 2026

Uh oh!

sonarqubecloud bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tiennguyen-onehouse commented Apr 17, 2026

Summary

Incident context (ENG-40451)

Test plan

Uh oh!

nimahajan commented Apr 17, 2026

Uh oh!

sonarqubecloud bot commented Apr 18, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants