Skip to content

Conversation

@ethan-tyler
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This is the end to end plumbing PR to get input_file_name() working. Started with an SLT test to define the expected behavior, then built out the plumbing to make it pass. Scoped to SELECT-list only (guaranteed pushdown case) per discussion with @alamb and @adriangb, with broader pushdown support to follow once #19538 lands.

What changes are included in this PR?

Add input_file_name() function that returns the file path for each row by injecting the value at the file opener boundary. Opt in (only when referenced), keeps SELECT * stable, errors on unsupported contexts.

Analyzer rewrite

  • Rewrites input_file_name() to reserved column __datafusion_input_file_name
  • Annotates TableScan.projected_schema only when needed
  • Errors on reserved name collisions

Physical planning + execution

  • Planner enables scan time injection when internal field is projected
  • FileScanConfig::open wraps opener to append Utf8 column with file location per batch
  • Stats/equivalence properties/schema updated for appended field

Optimizer

  • OptimizeProjections handles internal column safely (prevents index OOB)
  • Regression test: reserved column from source schema not treated as injected

Scope (V1)

  • Works in SELECT list only
  • Plan time errors for non file sources (VALUES/MemTable), joins (ambiguous file origin), and non SELECT list usage (WHERE/GROUP BY/ORDER BY/HAVING)

Are these changes tested?

Yes.

cargo test -p datafusion-sqllogictest --test sqllogictests -- input_file_name.slt
cargo test -p datafusion-datasource extended_file_columns_inject_input_file_name -q
cargo test -p datafusion-optimizer optimize_projections_keeps_reserved_column_from_source -q

SLT uses CSV for deterministic multi file assertions. Parquet supported via same FileScanConfig path and Parquet specific SLTs can follow.

Are there any user-facing changes?

Yes. New 0-arg volatile scalar function: input_file_name() -> Utf8

CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION '...';
SELECT col1, input_file_name() FROM t;

SELECT * output unchanged unless input_file_name() is explicitly referenced.

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate functions Changes to functions implementation datasource Changes to the datasource crate labels Jan 30, 2026
@adriangb
Copy link
Contributor

I think it should be much simpler than this: create a UDF and pass it through ProjectionExprs::transform_exprs(|expr| expr.transform(|expr| // if expr is our ScalarUDF, replace with literal filename))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate functions Changes to functions implementation optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add input_file_name built-in function

2 participants