Skip to content

Physical plan proto roundtrip for null-valued scalar of a Struct(Dict) data type #20011

@dispanser

Description

@dispanser

Describe the bug

We're currently upgrading from datafusion 45 to datafusion 46 🙈 and came across an issue related to dictionary ids during protobuf serde.
Here's a small reproducer, and this seems to still be a problem on current datafusion:

The following test works on df-45 but fails on df-46 and all later releases, added to the bottom of datafusion/proto/tests/cases/roundtrip_physical_plan.rs:

#[test]
fn roundtrip_call_null_scalar_struct_dict() {
    let data_type = DataType::Struct(Fields::from(vec![Field::new(
        "item",
        DataType::Dictionary(Box::new(DataType::UInt32), Box::new(DataType::Utf8)),
        true,
    )]));
    let schema = Arc::new(Schema::new(Fields::from([Arc::new(Field::new(
        "a",
        data_type.clone(),
        true,
    ))])));
    let scan = Arc::new(EmptyExec::new(schema.clone()));
    let scalar = lit(ScalarValue::try_from(data_type).unwrap());
    let filter = Arc::new(
        FilterExec::try_new(
            Arc::new(BinaryExpr::new(
                scalar,
                datafusion::logical_expr::Operator::Eq,
                col("a", &schema).unwrap(),
            )),
            scan,
        )
        .unwrap(),
    );
    roundtrip_test(filter).expect("roundtrip");
}

In df46, this fails during deserialization of protobuf back to the physical plan:

thread 'cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict' panicked at datafusion/proto/tests/cases/roundtrip_physical_plan.rs:140:10:
from proto: Plan("DataFusion error: ArrowError(SchemaError(\"Invalid data for schema. Field { name: \\\"item\\\", data_type: Dictionary(UInt32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } refers to node not found in schema\"), None)")

In df52, it already fails during serialization, as there's no pre-allocated dict id available from the dictionary tracker

thread 'cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict' (145817) panicked at datafusion/proto/tests/cases/roundtrip_physical_plan.rs:148:14:
to proto: Plan("General error: Error encoding ScalarValue::List as IPC: Ipc error: no dict id for field item")

I am a bit puzzled on how that is supposed to work. From what I can tell, the DictionaryTracker seems to be initialized empty and then never modified - I don't see a possible code path where a dict_id is even created.

To Reproduce

Run the test above within the context of datafusion/proto/tests/cases/roundtrip_physical_plan.rs

Expected behavior

Test passes, protobuf roundtrip of physical plan succeeds.

Additional context

I believe it started failing with work related to https://github.com/apache/datafusion/pull/14227/changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions