fix(ingest): size nested LossyList contexts from sample-size env vars#17590
Merged
treff7es merged 2 commits intoJun 2, 2026
Conversation
Contributor
|
Linear: ING-2760 Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned. |
Bundle ReportBundle size has no change ✅ |
treff7es
approved these changes
Jun 2, 2026
The DATAHUB_REPORT_*_SAMPLE_SIZE env vars only sized the outer dict of distinct error/warning entries. The per-entry context LossyList inside each StructuredLogEntry was always constructed with the default max_elements=10, so when many items were grouped under one error key (e.g. "Error processing table" with N affected tables) the context list silently truncated to 10 with a "... sampled of N total elements" note. Fix: - Add LossyList.resize() parallel to LossyDict.resize. - In StructuredLogs.report_log, size per-entry context lists from the level's configured LossyDict.max_elements. - In StructuredLogs.set_sample_sizes, also resize the context list on any entries already recorded (sources can log during __init__, before the pipeline applies the configured sample size). - Audit and fix the same nested-LossyList default in seven other reports: unity profile_table_errors, datahub database_parse_errors, gc sample_(soft|hard)_deleted_aspects_by_type, informatica sample_paths_by_type, classification info_types_detected, sql_parsing queries_with_temp_upstreams, and assertion compiler per-key warnings/failures. All changes are no-ops at default settings (env vars default to "10", matching the prior LossyList default), so existing reports are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mments - LossyList.resize(): raise ValueError for max_elements < 1 to prevent silent data loss (resize(0) made append() a permanent no-op) and the ValueError crash from random.sample when max_elements is negative. Cache current_len to avoid calling super().__len__() three times. Extend docstring to cover total_elements/len() semantics and sampled flag behaviour after grow. - LossyDict.resize(): same guard for consistency. - source.py report_log(): remove internal ticket reference from comment. - test_report_sample_size.py: remove ticket reference from docstring; add test_set_sample_sizes_shrinks_existing_context to cover the prune path (previously only the grow path was tested). - test_lossy_collections.py: add subset assertion to shrink test so retained elements are verified to come from the original set; add test_lossylist_resize_rejects_invalid_max_elements.
cbed12e to
4f1fc64
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
context: LossyList[str]insideStructuredLogEntrywas constructed with the defaultmax_elements=10, soDATAHUB_REPORT_FAILURE_SAMPLE_SIZE/_WARNING_/_INFO_only affected the number of distinct entries, not the number of grouped contexts under each entry. When many items got grouped under one error key (e.g."Error processing table"with many failing tables), the customer-visible context was capped at 10 with a"... sampled of N total elements"note even when the env var was set to 100,000.StructuredLogs.report_log(size the per-entry context list from the level's configuredmax_elements) and inset_sample_sizes(also resize context lists on already-recorded entries, so__init__-time logs are not stuck at the default).LossyList.resize()parallel to the existingLossyDict.resize.LossyListdefaults found by audit: unityprofile_table_errors, datahubdatabase_parse_errors, gcsample_(soft|hard)_deleted_aspects_by_type, informaticasample_paths_by_type, classificationinfo_types_detected, sql_parsingqueries_with_temp_upstreams, assertion compiler per-key warnings/failures."10", matching the priorLossyList()default, so existing reports are unchanged.Builds on the earlier sample-size work in #16165 and #17027.
Test plan
LossyList.resize(grow / shrink / no-op).DATAHUB_REPORT_FAILURE_SAMPLE_SIZEfor entries created after the env var is set.set_sample_sizes()also resizes the nested context on entries created before it ran (covers source__init__-time logs)../gradlew :metadata-ingestion:lintFixand:lintclean (ruff, format check, mypy)../gradlew :metadata-ingestion:testQuick— 12,493 pass. The 23 failures are all pre-existing on clean master (timezone drift, env-specific pyodbc, dataproduct golden files) and none touch the modified files.🤖 Generated with Claude Code