fix(ingest): size nested LossyList contexts from sample-size env vars by JohnRTurner · Pull Request #17590 · datahub-project/datahub

JohnRTurner · 2026-05-26T18:52:11Z

Summary

Per-entry context: LossyList[str] inside StructuredLogEntry was constructed with the default max_elements=10, so DATAHUB_REPORT_FAILURE_SAMPLE_SIZE / _WARNING_ / _INFO_ only affected the number of distinct entries, not the number of grouped contexts under each entry. When many items got grouped under one error key (e.g. "Error processing table" with many failing tables), the customer-visible context was capped at 10 with a "... sampled of N total elements" note even when the env var was set to 100,000.
Fix the root cause in StructuredLogs.report_log (size the per-entry context list from the level's configured max_elements) and in set_sample_sizes (also resize context lists on already-recorded entries, so __init__-time logs are not stuck at the default).
Add LossyList.resize() parallel to the existing LossyDict.resize.
Apply the same fix to seven other nested-LossyList defaults found by audit: unity profile_table_errors, datahub database_parse_errors, gc sample_(soft|hard)_deleted_aspects_by_type, informatica sample_paths_by_type, classification info_types_detected, sql_parsing queries_with_temp_upstreams, assertion compiler per-key warnings/failures.
Backward-compatible at defaults: env vars default to "10", matching the prior LossyList() default, so existing reports are unchanged.

Builds on the earlier sample-size work in #16165 and #17027.

Test plan

Added unit tests for LossyList.resize (grow / shrink / no-op).
Added regression tests for the bug:
- Per-entry context grows with DATAHUB_REPORT_FAILURE_SAMPLE_SIZE for entries created after the env var is set.
- set_sample_sizes() also resizes the nested context on entries created before it ran (covers source __init__-time logs).
./gradlew :metadata-ingestion:lintFix and :lint clean (ruff, format check, mypy).
./gradlew :metadata-ingestion:testQuick — 12,493 pass. The 23 failures are all pre-existing on clean master (timezone drift, env-specific pyodbc, dataproduct golden files) and none touch the modified files.

🤖 Generated with Claude Code

github-actions · 2026-05-26T18:52:45Z

Linear: ING-2760

Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned.

codecov · 2026-05-26T19:00:55Z

Bundle Report

Bundle size has no change ✅

The DATAHUB_REPORT_*_SAMPLE_SIZE env vars only sized the outer dict of distinct error/warning entries. The per-entry context LossyList inside each StructuredLogEntry was always constructed with the default max_elements=10, so when many items were grouped under one error key (e.g. "Error processing table" with N affected tables) the context list silently truncated to 10 with a "... sampled of N total elements" note. Fix: - Add LossyList.resize() parallel to LossyDict.resize. - In StructuredLogs.report_log, size per-entry context lists from the level's configured LossyDict.max_elements. - In StructuredLogs.set_sample_sizes, also resize the context list on any entries already recorded (sources can log during __init__, before the pipeline applies the configured sample size). - Audit and fix the same nested-LossyList default in seven other reports: unity profile_table_errors, datahub database_parse_errors, gc sample_(soft|hard)_deleted_aspects_by_type, informatica sample_paths_by_type, classification info_types_detected, sql_parsing queries_with_temp_upstreams, and assertion compiler per-key warnings/failures. All changes are no-ops at default settings (env vars default to "10", matching the prior LossyList default), so existing reports are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mments - LossyList.resize(): raise ValueError for max_elements < 1 to prevent silent data loss (resize(0) made append() a permanent no-op) and the ValueError crash from random.sample when max_elements is negative. Cache current_len to avoid calling super().__len__() three times. Extend docstring to cover total_elements/len() semantics and sampled flag behaviour after grow. - LossyDict.resize(): same guard for consistency. - source.py report_log(): remove internal ticket reference from comment. - test_report_sample_size.py: remove ticket reference from docstring; add test_set_sample_sizes_shrinks_existing_context to cover the prune path (previously only the grow path was tested). - test_lossy_collections.py: add subset assertion to shrink test so retained elements are verified to come from the original set; add test_lossylist_resize_rejects_invalid_max_elements.

github-actions Bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels May 26, 2026

vercel Bot deployed to Preview May 26, 2026 19:07 View deployment

maggiehays added the needs-review Label for PRs that need review from a maintainer. label May 26, 2026

treff7es mentioned this pull request Jun 2, 2026

fix(ingestion): respect sample size env var for per-entry context LossyList #17671

Closed

treff7es approved these changes Jun 2, 2026

View reviewed changes

JohnRTurner and others added 2 commits June 2, 2026 08:17

treff7es force-pushed the fix/ingestion-report-nested-lossy-list-truncation branch from cbed12e to 4f1fc64 Compare June 2, 2026 06:17

treff7es enabled auto-merge (squash) June 2, 2026 06:17

vercel Bot deployed to Preview June 2, 2026 06:32 View deployment

treff7es merged commit e0302ed into datahub-project:master Jun 2, 2026
72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest): size nested LossyList contexts from sample-size env vars#17590

fix(ingest): size nested LossyList contexts from sample-size env vars#17590
treff7es merged 2 commits into
datahub-project:masterfrom
JohnRTurner:fix/ingestion-report-nested-lossy-list-truncation

JohnRTurner commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohnRTurner commented May 26, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026

Bundle Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants