Skip to content

fix(ingest/s3): handle file names with dots in the stem when resolving format extension#17595

Open
rospe wants to merge 1 commit into
datahub-project:masterfrom
rospe:feat/s3-compression-extension-stem-dots
Open

fix(ingest/s3): handle file names with dots in the stem when resolving format extension#17595
rospe wants to merge 1 commit into
datahub-project:masterfrom
rospe:feat/s3-compression-extension-stem-dots

Conversation

@rospe
Copy link
Copy Markdown
Contributor

@rospe rospe commented May 27, 2026

Summary

pathlib.Path(...).suffix is purely lexical — it returns everything after the last dot. For S3 files like events.account.update-2026-05-27-<hash>.gz this means stripping the .gz compression suffix leaves .update-2026-05-27-<hash> as the apparent format extension, which is not a real file format. The inferrer then receives a meaningless extension and emits an "unsupported extension" warning even when default_extension is set on the path_spec.

This PR makes S3Source.get_fields only keep an extension after compression stripping (and for uncompressed files) if it matches one of the supported file types (csv, tsv, json, parquet, avro). Otherwise it falls through to default_extension so the inferrer can still be selected correctly.

Why

We hit this on a real bucket where event names contain dots in their file names, e.g.

s3://.../event=events.account.update/legalEntity=.../events.account.update-2026-05-27-11-<hash>.gz

Even with default_extension: json set on the path_spec, every one of these files emitted an "unsupported extension" warning because pathlib.suffix returned .update-2026-05-27-11-<hash> after the .gz was stripped — a non-empty value, so the default_extension fallback was never reached.

Behaviour

Preserved

Input Result
data.json .json
data.json.gz .json (compression stripped, inner extension kept)
data.parquet .parquet
data.parquet.gz .parquet
data.gz (default=json) .json
data.gz (no default) "" → warn (unchanged)
data (no extension, default=json) .json
data (no extension, no default) "" → warn (unchanged)

New

Input Old New
foo.bar.baz-<hash>.gz (default=json) .baz-<hash> → warn .json
foo.bar.baz-<hash> (default=json) .baz-<hash> → warn .json
data.txt (default=json) .txt → warn .json
data.JSON (uppercase, default=json) .JSON → warn .json

The behaviour change is one-directional and gated by default_extension: when a user sets it, it now applies to any file whose name does not have a recognised format extension, not just files with no dot at all. Without default_extension, behaviour is unchanged. There is no panic or data-loss risk — the worst case is a "could not infer schema" warning instead of the previous "unsupported extension" warning when default_extension is set and a stray file's actual content does not match the chosen format.

Changes

  • New _resolve_format_extension helper in s3/source.py extracted from S3Source.get_fields so the resolution logic can be unit-tested directly.
  • S3Source.get_fields now delegates to the helper.
  • PathSpec.default_extension field description updated to reflect that it applies to any file whose format cannot be inferred from the file name.
  • New "File type detection" section in metadata-ingestion/docs/sources/s3/s3_post.md documenting how extensions are resolved and when default_extension kicks in.
  • Parametrized unit tests in tests/unit/s3/test_s3_source.py covering preserved behaviour (plain .json, .json.gz, .parquet.gz, .gz-only with/without default, compression disabled) and new behaviour (dotted-stem compressed/uncompressed, with/without default).

Checklist

  • PR title follows the conventional commit format
  • Tests added (10 parametrized cases on the new helper)
  • Docs updated (S3 connector docs + PathSpec.default_extension description)
  • No breaking change — see behaviour table above

Notes for reviewers

  • The pre-existing test_get_folder_info_returns_expected_folder failure on master (1-hour timezone offset between tzutc() and datetime.timezone.utc) is unrelated to this change.

…g format extension

`pathlib.Path(...).suffix` is purely lexical — it returns everything
after the last dot. For files like `events.account.update-2026-05-27-<hash>.gz`
this means stripping the `.gz` compression suffix leaves
`.update-2026-05-27-<hash>` as the apparent format extension, which is
not a real file format. The inferrer then receives a meaningless
extension and emits an "unsupported extension" warning even when
`default_extension` is set on the path_spec.

Only keep an extension after compression stripping (and for
uncompressed files) if it matches one of the supported file types
(csv, tsv, json, parquet, avro). Otherwise fall through to
`default_extension` so the inferrer can still be selected correctly.

Behaviour preserved:

- `data.json`        -> `.json`
- `data.json.gz`     -> `.json` (compression stripped, inner extension kept)
- `data.parquet.gz`  -> `.parquet`
- `data.gz`          -> `default_extension` if set, else `""`

New behaviour:

- `foo.bar.baz-<hash>.gz` -> `default_extension` (was `.baz-<hash>`)
- `foo.bar.baz-<hash>`    -> `default_extension` (was `.baz-<hash>`)

Adds:
- `_resolve_format_extension` helper extracted from `S3Source.get_fields`
  for direct unit testing
- parametrized unit tests covering both preserved and new behaviour
- updated `default_extension` field description on `PathSpec`
- "File type detection" docs section in the S3 connector docs
@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels May 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2768

Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned.

@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants