Fallback to parquet footers when metadata table col stats are missing#818
Open
vinishjail97 wants to merge 3 commits intomainfrom
Open
Fallback to parquet footers when metadata table col stats are missing#818vinishjail97 wants to merge 3 commits intomainfrom
vinishjail97 wants to merge 3 commits intomainfrom
Conversation
… for a file In computeColumnStatsFromMetadataTable, files absent from the metadata table's column stats index were silently returned with empty columnStats and recordCount=0, causing downstream consumers to treat valid files as empty. After fetching stats from the metadata table, files with empty columnStats are now detected and re-processed via computeColumnStatsFromParquetFooters. A WARN log is emitted when the fallback is triggered.
| return withStats.stream(); | ||
| } | ||
|
|
||
| private void classifyFileByMetadataStats( |
Contributor
There was a problem hiding this comment.
I think it would be better to update the flow here to have a method that returns an Optional<InternalDataFile> and then you use that to add to the lists in the other method. I think this will make the code a bit easier to follow.
| .collect(CustomCollectors.toList(fileStats.size())); | ||
| long recordCount = getMaxFromColumnStats(columnStats).orElse(0L); | ||
| InternalDataFile result = | ||
| file.toBuilder().columnStats(columnStats).recordCount(recordCount).build(); |
Contributor
There was a problem hiding this comment.
If there are no stats, we should avoid the toBuilder and new object creation.
|
|
||
| assertEquals(2, output.size()); | ||
| // fileWithoutStats must have fallen back to parquet footers and have full stats | ||
| InternalDataFile fallen = |
Contributor
There was a problem hiding this comment.
Can we call this fromFooter to match fromMeta?
…ats, rename fallen to fromFooter
Contributor
Author
|
Addressed all feedback in cf186db:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
computeColumnStatsFromMetadataTable, files absent from the metadata table's column stats index were silently returned with emptycolumnStatsandrecordCount=0, causing downstream consumers to treat valid files as emptycolumnStatsare now detected and re-processed viacomputeColumnStatsFromParquetFootersWARNlog is emitted when the fallback is triggered, providing visibility into metadata table gapsTest plan
TestHudiFileStatsExtractortests pass