Skip to content

fix(ingest): size nested LossyList contexts from sample-size env vars#17590

Merged
treff7es merged 2 commits into
datahub-project:masterfrom
JohnRTurner:fix/ingestion-report-nested-lossy-list-truncation
Jun 2, 2026
Merged

fix(ingest): size nested LossyList contexts from sample-size env vars#17590
treff7es merged 2 commits into
datahub-project:masterfrom
JohnRTurner:fix/ingestion-report-nested-lossy-list-truncation

Conversation

@JohnRTurner
Copy link
Copy Markdown
Contributor

Summary

  • Per-entry context: LossyList[str] inside StructuredLogEntry was constructed with the default max_elements=10, so DATAHUB_REPORT_FAILURE_SAMPLE_SIZE / _WARNING_ / _INFO_ only affected the number of distinct entries, not the number of grouped contexts under each entry. When many items got grouped under one error key (e.g. "Error processing table" with many failing tables), the customer-visible context was capped at 10 with a "... sampled of N total elements" note even when the env var was set to 100,000.
  • Fix the root cause in StructuredLogs.report_log (size the per-entry context list from the level's configured max_elements) and in set_sample_sizes (also resize context lists on already-recorded entries, so __init__-time logs are not stuck at the default).
  • Add LossyList.resize() parallel to the existing LossyDict.resize.
  • Apply the same fix to seven other nested-LossyList defaults found by audit: unity profile_table_errors, datahub database_parse_errors, gc sample_(soft|hard)_deleted_aspects_by_type, informatica sample_paths_by_type, classification info_types_detected, sql_parsing queries_with_temp_upstreams, assertion compiler per-key warnings/failures.
  • Backward-compatible at defaults: env vars default to "10", matching the prior LossyList() default, so existing reports are unchanged.

Builds on the earlier sample-size work in #16165 and #17027.

Test plan

  • Added unit tests for LossyList.resize (grow / shrink / no-op).
  • Added regression tests for the bug:
    • Per-entry context grows with DATAHUB_REPORT_FAILURE_SAMPLE_SIZE for entries created after the env var is set.
    • set_sample_sizes() also resizes the nested context on entries created before it ran (covers source __init__-time logs).
  • ./gradlew :metadata-ingestion:lintFix and :lint clean (ruff, format check, mypy).
  • ./gradlew :metadata-ingestion:testQuick — 12,493 pass. The 23 failures are all pre-existing on clean master (timezone drift, env-specific pyodbc, dataproduct golden files) and none touch the modified files.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2760

Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned.

@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels May 26, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Bundle Report

Bundle size has no change ✅

JohnRTurner and others added 2 commits June 2, 2026 08:17
The DATAHUB_REPORT_*_SAMPLE_SIZE env vars only sized the outer dict of
distinct error/warning entries. The per-entry context LossyList inside
each StructuredLogEntry was always constructed with the default
max_elements=10, so when many items were grouped under one error key
(e.g. "Error processing table" with N affected tables) the context list
silently truncated to 10 with a "... sampled of N total elements" note.

Fix:

- Add LossyList.resize() parallel to LossyDict.resize.
- In StructuredLogs.report_log, size per-entry context lists from the
  level's configured LossyDict.max_elements.
- In StructuredLogs.set_sample_sizes, also resize the context list on
  any entries already recorded (sources can log during __init__, before
  the pipeline applies the configured sample size).
- Audit and fix the same nested-LossyList default in seven other
  reports: unity profile_table_errors, datahub database_parse_errors,
  gc sample_(soft|hard)_deleted_aspects_by_type, informatica
  sample_paths_by_type, classification info_types_detected,
  sql_parsing queries_with_temp_upstreams, and assertion compiler
  per-key warnings/failures.

All changes are no-ops at default settings (env vars default to "10",
matching the prior LossyList default), so existing reports are
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mments

- LossyList.resize(): raise ValueError for max_elements < 1 to prevent
  silent data loss (resize(0) made append() a permanent no-op) and the
  ValueError crash from random.sample when max_elements is negative.
  Cache current_len to avoid calling super().__len__() three times.
  Extend docstring to cover total_elements/len() semantics and sampled
  flag behaviour after grow.
- LossyDict.resize(): same guard for consistency.
- source.py report_log(): remove internal ticket reference from comment.
- test_report_sample_size.py: remove ticket reference from docstring;
  add test_set_sample_sizes_shrinks_existing_context to cover the prune
  path (previously only the grow path was tested).
- test_lossy_collections.py: add subset assertion to shrink test so
  retained elements are verified to come from the original set; add
  test_lossylist_resize_rejects_invalid_max_elements.
@treff7es treff7es force-pushed the fix/ingestion-report-nested-lossy-list-truncation branch from cbed12e to 4f1fc64 Compare June 2, 2026 06:17
@treff7es treff7es enabled auto-merge (squash) June 2, 2026 06:17
@treff7es treff7es merged commit e0302ed into datahub-project:master Jun 2, 2026
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants