fix(ingest/s3): handle file names with dots in the stem when resolving format extension#17595
Open
rospe wants to merge 1 commit into
Open
fix(ingest/s3): handle file names with dots in the stem when resolving format extension#17595rospe wants to merge 1 commit into
rospe wants to merge 1 commit into
Conversation
…g format extension `pathlib.Path(...).suffix` is purely lexical — it returns everything after the last dot. For files like `events.account.update-2026-05-27-<hash>.gz` this means stripping the `.gz` compression suffix leaves `.update-2026-05-27-<hash>` as the apparent format extension, which is not a real file format. The inferrer then receives a meaningless extension and emits an "unsupported extension" warning even when `default_extension` is set on the path_spec. Only keep an extension after compression stripping (and for uncompressed files) if it matches one of the supported file types (csv, tsv, json, parquet, avro). Otherwise fall through to `default_extension` so the inferrer can still be selected correctly. Behaviour preserved: - `data.json` -> `.json` - `data.json.gz` -> `.json` (compression stripped, inner extension kept) - `data.parquet.gz` -> `.parquet` - `data.gz` -> `default_extension` if set, else `""` New behaviour: - `foo.bar.baz-<hash>.gz` -> `default_extension` (was `.baz-<hash>`) - `foo.bar.baz-<hash>` -> `default_extension` (was `.baz-<hash>`) Adds: - `_resolve_format_extension` helper extracted from `S3Source.get_fields` for direct unit testing - parametrized unit tests covering both preserved and new behaviour - updated `default_extension` field description on `PathSpec` - "File type detection" docs section in the S3 connector docs
Contributor
|
Linear: ING-2768 Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pathlib.Path(...).suffixis purely lexical — it returns everything after the last dot. For S3 files likeevents.account.update-2026-05-27-<hash>.gzthis means stripping the.gzcompression suffix leaves.update-2026-05-27-<hash>as the apparent format extension, which is not a real file format. The inferrer then receives a meaningless extension and emits an"unsupported extension"warning even whendefault_extensionis set on the path_spec.This PR makes
S3Source.get_fieldsonly keep an extension after compression stripping (and for uncompressed files) if it matches one of the supported file types (csv,tsv,json,parquet,avro). Otherwise it falls through todefault_extensionso the inferrer can still be selected correctly.Why
We hit this on a real bucket where event names contain dots in their file names, e.g.
s3://.../event=events.account.update/legalEntity=.../events.account.update-2026-05-27-11-<hash>.gzEven with
default_extension: jsonset on the path_spec, every one of these files emitted an"unsupported extension"warning becausepathlib.suffixreturned.update-2026-05-27-11-<hash>after the.gzwas stripped — a non-empty value, so thedefault_extensionfallback was never reached.Behaviour
Preserved
data.json.jsondata.json.gz.json(compression stripped, inner extension kept)data.parquet.parquetdata.parquet.gz.parquetdata.gz(default=json).jsondata.gz(no default)""→ warn (unchanged)data(no extension, default=json).jsondata(no extension, no default)""→ warn (unchanged)New
foo.bar.baz-<hash>.gz(default=json).baz-<hash>→ warn.json✓foo.bar.baz-<hash>(default=json).baz-<hash>→ warn.json✓data.txt(default=json).txt→ warn.json✓data.JSON(uppercase, default=json).JSON→ warn.json✓The behaviour change is one-directional and gated by
default_extension: when a user sets it, it now applies to any file whose name does not have a recognised format extension, not just files with no dot at all. Withoutdefault_extension, behaviour is unchanged. There is no panic or data-loss risk — the worst case is a"could not infer schema"warning instead of the previous"unsupported extension"warning whendefault_extensionis set and a stray file's actual content does not match the chosen format.Changes
_resolve_format_extensionhelper ins3/source.pyextracted fromS3Source.get_fieldsso the resolution logic can be unit-tested directly.S3Source.get_fieldsnow delegates to the helper.PathSpec.default_extensionfield description updated to reflect that it applies to any file whose format cannot be inferred from the file name.metadata-ingestion/docs/sources/s3/s3_post.mddocumenting how extensions are resolved and whendefault_extensionkicks in.tests/unit/s3/test_s3_source.pycovering preserved behaviour (plain.json,.json.gz,.parquet.gz,.gz-only with/without default, compression disabled) and new behaviour (dotted-stem compressed/uncompressed, with/without default).Checklist
PathSpec.default_extensiondescription)Notes for reviewers
test_get_folder_info_returns_expected_folderfailure on master (1-hour timezone offset betweentzutc()anddatetime.timezone.utc) is unrelated to this change.