feat: add ExpressionPlacement enum for optimizer expression placement decisions #20065

adriangb · 2026-01-29T17:35:01Z

Summary

This PR is part of work towards #19387

Extracts the ExpressionPlacement enum from #20036 to provide a mechanism for expressions to indicate where they should be placed in the query plan for optimal execution.

I've opted to go the route of having expressions declare their behavior via a new API on enum Expr and trait PhysicalExpr:

enum Expr {
    pub fn placement(&self) -> ExpressionPlacement { ... }
   ...
}

And:

trait PhysicalExpr {
   fn placement(&self) -> ExpressionPlacement { ... }
}

Where ExpressionPlacement:

enum ExpressionPlacement {
    /// Argument is a literal constant value or an expression that can be
    /// evaluated to a constant at planning time.
    Literal,
    /// Argument is a simple column reference.
    Column,
    /// Argument is a complex expression that can be safely placed at leaf nodes.
    /// For example, if `get_field(struct_col, 'field_name')` is implemented as a
    /// leaf-pushable expression, then it would return this variant.
    /// Then `other_leaf_function(get_field(...), 42)` could also be classified as
    /// leaf-pushable using the knowledge that `get_field(...)` is leaf-pushable.
    PlaceAtLeaves,
    /// Argument is a complex expression that should be placed at root nodes.
    /// For example, `min(col1 + col2)` is not leaf-pushable because it requires per-row computation.
    PlaceAtRoot,
}

We arrived at ExprPlacement after iterating through a version that had:

enum ArgTriviality {
    Literal,
    Column,
    Trivial,
    NonTrivial,
}

This terminology came from existing concepts in the codebase that were sprinkled around various places in the logical and physical layers. Some examples:

datafusion/datafusion/physical-plan/src/projection.rs

Lines 282 to 290 in f819061

    
           let all_simple_exprs = 
        
               self.projector 
        
                   .projection() 
        
                   .as_ref() 
        
                   .iter() 
        
                   .all(|proj_expr| { 
        
                       proj_expr.expr.as_any().is::<Column>() 
        
                           || proj_expr.expr.as_any().is::<Literal>() 
        
                   });

datafusion/datafusion/physical-plan/src/projection.rs

Lines 1120 to 1125 in f819061

    
           /// Checks if the given expression is trivial. 
        
           /// An expression is considered trivial if it is either a `Column` or a `Literal`. 
        
           fn is_expr_trivial(expr: &Arc<dyn PhysicalExpr>) -> bool { 
        
               expr.as_any().downcast_ref::<Column>().is_some() 
        
                   || expr.as_any().downcast_ref::<Literal>().is_some() 
        
           }

datafusion/datafusion/optimizer/src/optimize_projections/mod.rs

Lines 589 to 592 in f819061

    
           // Check whether `expr` is trivial; i.e. it doesn't imply any computation. 
        
           fn is_expr_trivial(expr: &Expr) -> bool { 
        
               matches!(expr, Expr::Column(_) | Expr::Literal(_, _)) 
        
           }

The new API adds the nuance / distinction of the case of get_field(col, 'a') where it is neither a column nor a literal but it is trivial.

It also gives scalar functions the ability to classify themselves.
This part was a bit tricky because ScalarUDFImpl (the scalar function trait that users implement) lives in datafuions-expr which cannot have references to datafusion-physical-expr-common (where PhysicalExpr is defined).
But once we are in the physical layer scalar functions are represented as func: ScalarUDFImpl, args: Vec<Arc<dyn PhysicalExpr>>.
And since we can't have a trait method referencing PhysicalExpr there would be no way to ask a function to classify itself in the physical layer.

Additionally even if we could refer to PhysicalExpr from the ScalarUDFImpl trait we would then need 2 methods with similar but divergent logic (match on the Expr enum in one, downcast to various known types in the physical version) that adds boilerplate for implementers.

The ExprPlacement enum solves this problem: we can have a single method ScalarUDFImpl::placement(args: &[ExpressionPlacement]).
The parent of ScalarUDFImpl will call either Expr::placement or PhysicalExpr::placement depending on which one it has.

Changes

Add ExpressionPlacement enum in datafusion-expr-common with four variants:
- Literal - constant values
- Column - simple column references
- PlaceAtLeaves - cheap expressions (like get_field) that can be pushed to leaf nodes
- PlaceAtRoot - expensive expressions that should stay at root
Add placement() method to:
- Expr enum
- ScalarUDF / ScalarUDFImpl traits (with default returning PlaceAtRoot)
- PhysicalExpr trait (with default returning PlaceAtRoot)
- Physical expression implementations for Column, Literal, and ScalarFunctionExpr
Implement placement() for GetFieldFunc that returns PlaceAtLeaves when accessing struct fields with literal keys
Replace is_expr_trivial() function checks with placement() checks in:
- datafusion/optimizer/src/optimize_projections/mod.rs
- datafusion/physical-plan/src/projection.rs

Test Plan

cargo check passes on all affected packages
cargo test -p datafusion-optimizer passes
cargo test -p datafusion-physical-plan passes (except unrelated zstd feature test)
cargo test -p datafusion-functions --lib getfield passes

🤖 Generated with Claude Code

… decisions This extracts the ExpressionPlacement enum from PR apache#20036 to provide a mechanism for expressions to indicate where they should be placed in the query plan for optimal execution. Changes: - Add ExpressionPlacement enum with variants: Literal, Column, PlaceAtLeaves, PlaceAtRoot - Add placement() method to Expr, ScalarUDF, ScalarUDFImpl traits - Add placement() method to PhysicalExpr trait and implementations - Implement placement() for GetFieldFunc to return PlaceAtLeaves when accessing struct fields with literal keys - Replace is_expr_trivial() checks with placement() in optimizer and physical-plan projection code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add tests for GetFieldFunc::placement() covering: - Literal key access (leaf-pushable) - Column key access (not leaf-pushable) - PlaceAtRoot base expressions - Edge cases (empty args, literal base) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

adriangb · 2026-01-29T18:50:23Z

datafusion/sqllogictest/test_files/projection_pushdown.slt

-02)--ProjectionExec: expr=[get_field(s@0, value) as __common_expr_1]
-03)----FilterExec: id@0 > 2, projection=[s@1]
-04)------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, s], file_type=parquet, predicate=id@0 > 2, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 2, required_guarantees=[]
+01)ProjectionExec: expr=[get_field(s@0, value) + get_field(s@0, value) as doubled]


This is actually correct / an improvement. We are saying that get_field is very cheap, so no need to deduplicate it. I added an extra test above that shows that a more complex expression (id + s['value']) will get dudplicated (as a whole) by the CSE optimizer.

adriangb · 2026-01-29T18:52:38Z

datafusion/sqllogictest/test_files/unnest.slt

-03)----ProjectionExec: expr=[get_field(__unnest_placeholder(recursive_unnest_table.column3,depth=1)@0, c1) as __unnest_placeholder(UNNEST(recursive_unnest_table.column3)[c1]), column3@1 as column3]
-04)------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+03)----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+04)------ProjectionExec: expr=[get_field(__unnest_placeholder(recursive_unnest_table.column3,depth=1)@0, c1) as __unnest_placeholder(UNNEST(recursive_unnest_table.column3)[c1]), column3@1 as column3]


Because the projection is a get_field it can be pushed under the RepartitionExec. Because this is a MemorySourceConfig data source (which doesn't accept projections, it's pointless to do so) it doesn't get pushed into the scan. But this is still correct / a win: we reduce the size of the data very cheaply by pulling out the field we care above and discarding the rest before we slice up the data in the RepartitionExec.

adriangb · 2026-01-29T18:53:24Z

datafusion/optimizer/src/optimize_projections/mod.rs

-// Check whether `expr` is trivial; i.e. it doesn't imply any computation.
-fn is_expr_trivial(expr: &Expr) -> bool {
-    matches!(expr, Expr::Column(_) | Expr::Literal(_, _))
-}


The point of this PR is to formalize up these hidden assumptions about expressions and let expressions like get_field participate in the decision of how to treat the expression.

adriangb · 2026-01-29T20:13:39Z

datafusion/sqllogictest/test_files/aggregate.slt

+01)ProjectionExec: expr=[count(Int64(1))@0 as count(), count(Int64(1))@0 as count(*)]
+02)--ProjectionExec: expr=[2 as count(Int64(1))]
+03)----PlaceholderRowExec


This is also an improvement.

Previously we would brodcast the literal 2 to the number of rows twice (2 as count(), 2 as count(*) -> lit(2), lit(2). Now we first expand it once in an inner projection then reference that column with aliases twice (i.e. clone the pointer to the array).

In this case it's 1 row so it's not really meaningful, but is nonetheless better. I'll see if I can craft an example that shows this behavior with N rows.

adriangb · 2026-01-29T21:22:13Z

@AdamGS I think this should be in a reviewable state 😄

adriangb self-assigned this Jan 29, 2026

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Jan 29, 2026

adriangb and others added 2 commits January 29, 2026 12:37

update tests

1fd19d2

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 29, 2026

adriangb commented Jan 29, 2026

View reviewed changes

wrap up logic into function

254b4af

adriangb marked this pull request as ready for review January 29, 2026 19:13

adriangb requested a review from alamb January 29, 2026 19:13

adriangb mentioned this pull request Jan 29, 2026

Allow struct field access projections to be pushed down into scans #19538

Open

update aggregate slt

6c372d9

adriangb commented Jan 29, 2026

View reviewed changes

add comment about literals and cse:

01b8a46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ExpressionPlacement enum for optimizer expression placement decisions #20065

feat: add ExpressionPlacement enum for optimizer expression placement decisions #20065

adriangb commented Jan 29, 2026 •

edited

Loading

Uh oh!

adriangb Jan 29, 2026

Uh oh!

adriangb Jan 29, 2026

Uh oh!

adriangb Jan 29, 2026

Uh oh!

adriangb Jan 29, 2026

Uh oh!

adriangb commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	let all_simple_exprs =
	self.projector
	.projection()
	.as_ref()
	.iter()
	.all(\|proj_expr\| {
	proj_expr.expr.as_any().is::<Column>()
	\|\| proj_expr.expr.as_any().is::<Literal>()
	});

	/// Checks if the given expression is trivial.
	/// An expression is considered trivial if it is either a `Column` or a `Literal`.
	fn is_expr_trivial(expr: &Arc<dyn PhysicalExpr>) -> bool {
	expr.as_any().downcast_ref::<Column>().is_some()
	\|\| expr.as_any().downcast_ref::<Literal>().is_some()
	}

	// Check whether `expr` is trivial; i.e. it doesn't imply any computation.
	fn is_expr_trivial(expr: &Expr) -> bool {
	matches!(expr, Expr::Column(_) \| Expr::Literal(_, _))
	}

feat: add ExpressionPlacement enum for optimizer expression placement decisions #20065

Are you sure you want to change the base?

feat: add ExpressionPlacement enum for optimizer expression placement decisions #20065

Conversation

adriangb commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Uh oh!

adriangb Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adriangb commented Jan 29, 2026 •

edited

Loading