Skip to content

Fallback to parquet footers when metadata table col stats are missing#818

Open
vinishjail97 wants to merge 3 commits intomainfrom
fallback-to-parquet-footers-for-missing-metadata-col-stats
Open

Fallback to parquet footers when metadata table col stats are missing#818
vinishjail97 wants to merge 3 commits intomainfrom
fallback-to-parquet-footers-for-missing-metadata-col-stats

Conversation

@vinishjail97
Copy link
Copy Markdown
Contributor

@vinishjail97 vinishjail97 commented Apr 1, 2026

Summary

  • In computeColumnStatsFromMetadataTable, files absent from the metadata table's column stats index were silently returned with empty columnStats and recordCount=0, causing downstream consumers to treat valid files as empty
  • After fetching stats from the metadata table, files with empty columnStats are now detected and re-processed via computeColumnStatsFromParquetFooters
  • A WARN log is emitted when the fallback is triggered, providing visibility into metadata table gaps

Test plan

  • Existing TestHudiFileStatsExtractor tests pass
  • New test: all metadata stats missing — falls back to parquet footers
  • New test: partial metadata stats missing — only missing files fall back to parquet footers

… for a file

In computeColumnStatsFromMetadataTable, files absent from the metadata
table's column stats index were silently returned with empty columnStats
and recordCount=0, causing downstream consumers to treat valid files as
empty. After fetching stats from the metadata table, files with empty
columnStats are now detected and re-processed via
computeColumnStatsFromParquetFooters. A WARN log is emitted when the
fallback is triggered.
@rahil-c rahil-c self-requested a review April 3, 2026 21:30
return withStats.stream();
}

private void classifyFileByMetadataStats(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to update the flow here to have a method that returns an Optional<InternalDataFile> and then you use that to add to the lists in the other method. I think this will make the code a bit easier to follow.

.collect(CustomCollectors.toList(fileStats.size()));
long recordCount = getMaxFromColumnStats(columnStats).orElse(0L);
InternalDataFile result =
file.toBuilder().columnStats(columnStats).recordCount(recordCount).build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no stats, we should avoid the toBuilder and new object creation.


assertEquals(2, output.size());
// fileWithoutStats must have fallen back to parquet footers and have full stats
InternalDataFile fallen =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this fromFooter to match fromMeta?

@vinishjail97
Copy link
Copy Markdown
Contributor Author

Addressed all feedback in cf186db:

  • Renamed classifyFileByMetadataStats to tryEnrichWithMetadataStats and changed it to return Optional<InternalDataFile> — present means stats found, empty means needs fallback. The caller uses ifPresentOrElse to route into the two lists. (comment 1)
  • Added an early return of Optional.empty() when fileStats is empty, avoiding the unnecessary toBuilder() and object creation. (comment 2)
  • Renamed fallen to fromFooter to match fromMeta. (comment 3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants