Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jan 29, 2026

Summary

This PR is part of work towards #19387

Extracts the ExpressionPlacement enum from #20036 to provide a mechanism for expressions to indicate where they should be placed in the query plan for optimal execution.

I've opted to go the route of having expressions declare their behavior via a new API on enum Expr and trait PhysicalExpr:

enum Expr {
    pub fn placement(&self) -> ExpressionPlacement { ... }
   ...
}

And:

trait PhysicalExpr {
   fn placement(&self) -> ExpressionPlacement { ... }
}

Where ExpressionPlacement:

enum ExpressionPlacement {
    /// Argument is a literal constant value or an expression that can be
    /// evaluated to a constant at planning time.
    Literal,
    /// Argument is a simple column reference.
    Column,
    /// Argument is a complex expression that can be safely placed at leaf nodes.
    /// For example, if `get_field(struct_col, 'field_name')` is implemented as a
    /// leaf-pushable expression, then it would return this variant.
    /// Then `other_leaf_function(get_field(...), 42)` could also be classified as
    /// leaf-pushable using the knowledge that `get_field(...)` is leaf-pushable.
    PlaceAtLeaves,
    /// Argument is a complex expression that should be placed at root nodes.
    /// For example, `min(col1 + col2)` is not leaf-pushable because it requires per-row computation.
    PlaceAtRoot,
}

We arrived at ExprPlacement after iterating through a version that had:

enum ArgTriviality {
    Literal,
    Column,
    Trivial,
    NonTrivial,
}

This terminology came from existing concepts in the codebase that were sprinkled around various places in the logical and physical layers. Some examples:

let all_simple_exprs =
self.projector
.projection()
.as_ref()
.iter()
.all(|proj_expr| {
proj_expr.expr.as_any().is::<Column>()
|| proj_expr.expr.as_any().is::<Literal>()
});

/// Checks if the given expression is trivial.
/// An expression is considered trivial if it is either a `Column` or a `Literal`.
fn is_expr_trivial(expr: &Arc<dyn PhysicalExpr>) -> bool {
expr.as_any().downcast_ref::<Column>().is_some()
|| expr.as_any().downcast_ref::<Literal>().is_some()
}

// Check whether `expr` is trivial; i.e. it doesn't imply any computation.
fn is_expr_trivial(expr: &Expr) -> bool {
matches!(expr, Expr::Column(_) | Expr::Literal(_, _))
}

The new API adds the nuance / distinction of the case of get_field(col, 'a') where it is neither a column nor a literal but it is trivial.

It also gives scalar functions the ability to classify themselves.
This part was a bit tricky because ScalarUDFImpl (the scalar function trait that users implement) lives in datafuions-expr which cannot have references to datafusion-physical-expr-common (where PhysicalExpr is defined).
But once we are in the physical layer scalar functions are represented as func: ScalarUDFImpl, args: Vec<Arc<dyn PhysicalExpr>>.
And since we can't have a trait method referencing PhysicalExpr there would be no way to ask a function to classify itself in the physical layer.

Additionally even if we could refer to PhysicalExpr from the ScalarUDFImpl trait we would then need 2 methods with similar but divergent logic (match on the Expr enum in one, downcast to various known types in the physical version) that adds boilerplate for implementers.

The ExprPlacement enum solves this problem: we can have a single method ScalarUDFImpl::placement(args: &[ExpressionPlacement]).
The parent of ScalarUDFImpl will call either Expr::placement or PhysicalExpr::placement depending on which one it has.

Changes

  • Add ExpressionPlacement enum in datafusion-expr-common with four variants:

    • Literal - constant values
    • Column - simple column references
    • PlaceAtLeaves - cheap expressions (like get_field) that can be pushed to leaf nodes
    • PlaceAtRoot - expensive expressions that should stay at root
  • Add placement() method to:

    • Expr enum
    • ScalarUDF / ScalarUDFImpl traits (with default returning PlaceAtRoot)
    • PhysicalExpr trait (with default returning PlaceAtRoot)
    • Physical expression implementations for Column, Literal, and ScalarFunctionExpr
  • Implement placement() for GetFieldFunc that returns PlaceAtLeaves when accessing struct fields with literal keys

  • Replace is_expr_trivial() function checks with placement() checks in:

    • datafusion/optimizer/src/optimize_projections/mod.rs
    • datafusion/physical-plan/src/projection.rs

Test Plan

  • cargo check passes on all affected packages
  • cargo test -p datafusion-optimizer passes
  • cargo test -p datafusion-physical-plan passes (except unrelated zstd feature test)
  • cargo test -p datafusion-functions --lib getfield passes

🤖 Generated with Claude Code

… decisions

This extracts the ExpressionPlacement enum from PR apache#20036 to provide a
mechanism for expressions to indicate where they should be placed in
the query plan for optimal execution.

Changes:
- Add ExpressionPlacement enum with variants: Literal, Column,
  PlaceAtLeaves, PlaceAtRoot
- Add placement() method to Expr, ScalarUDF, ScalarUDFImpl traits
- Add placement() method to PhysicalExpr trait and implementations
- Implement placement() for GetFieldFunc to return PlaceAtLeaves when
  accessing struct fields with literal keys
- Replace is_expr_trivial() checks with placement() in optimizer and
  physical-plan projection code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@adriangb adriangb self-assigned this Jan 29, 2026
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Jan 29, 2026
adriangb and others added 2 commits January 29, 2026 12:37
Add tests for GetFieldFunc::placement() covering:
- Literal key access (leaf-pushable)
- Column key access (not leaf-pushable)
- PlaceAtRoot base expressions
- Edge cases (empty args, literal base)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 29, 2026
02)--ProjectionExec: expr=[get_field(s@0, value) as __common_expr_1]
03)----FilterExec: id@0 > 2, projection=[s@1]
04)------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, s], file_type=parquet, predicate=id@0 > 2, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 2, required_guarantees=[]
01)ProjectionExec: expr=[get_field(s@0, value) + get_field(s@0, value) as doubled]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually correct / an improvement. We are saying that get_field is very cheap, so no need to deduplicate it. I added an extra test above that shows that a more complex expression (id + s['value']) will get dudplicated (as a whole) by the CSE optimizer.

03)----ProjectionExec: expr=[get_field(__unnest_placeholder(recursive_unnest_table.column3,depth=1)@0, c1) as __unnest_placeholder(UNNEST(recursive_unnest_table.column3)[c1]), column3@1 as column3]
04)------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
03)----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
04)------ProjectionExec: expr=[get_field(__unnest_placeholder(recursive_unnest_table.column3,depth=1)@0, c1) as __unnest_placeholder(UNNEST(recursive_unnest_table.column3)[c1]), column3@1 as column3]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the projection is a get_field it can be pushed under the RepartitionExec. Because this is a MemorySourceConfig data source (which doesn't accept projections, it's pointless to do so) it doesn't get pushed into the scan. But this is still correct / a win: we reduce the size of the data very cheaply by pulling out the field we care above and discarding the rest before we slice up the data in the RepartitionExec.

Comment on lines -589 to -592
// Check whether `expr` is trivial; i.e. it doesn't imply any computation.
fn is_expr_trivial(expr: &Expr) -> bool {
matches!(expr, Expr::Column(_) | Expr::Literal(_, _))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR is to formalize up these hidden assumptions about expressions and let expressions like get_field participate in the decision of how to treat the expression.

Comment on lines +7988 to +7990
01)ProjectionExec: expr=[count(Int64(1))@0 as count(), count(Int64(1))@0 as count(*)]
02)--ProjectionExec: expr=[2 as count(Int64(1))]
03)----PlaceholderRowExec
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also an improvement.

Previously we would brodcast the literal 2 to the number of rows twice (2 as count(), 2 as count(*) -> lit(2), lit(2). Now we first expand it once in an inner projection then reference that column with aliases twice (i.e. clone the pointer to the array).

In this case it's 1 row so it's not really meaningful, but is nonetheless better. I'll see if I can craft an example that shows this behavior with N rows.

@adriangb
Copy link
Contributor Author

@AdamGS I think this should be in a reviewable state 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant