-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
We're currently upgrading from datafusion 45 to datafusion 46 🙈 and came across an issue related to dictionary ids during protobuf serde.
Here's a small reproducer, and this seems to still be a problem on current datafusion:
The following test works on df-45 but fails on df-46 and all later releases, added to the bottom of datafusion/proto/tests/cases/roundtrip_physical_plan.rs:
#[test]
fn roundtrip_call_null_scalar_struct_dict() {
let data_type = DataType::Struct(Fields::from(vec![Field::new(
"item",
DataType::Dictionary(Box::new(DataType::UInt32), Box::new(DataType::Utf8)),
true,
)]));
let schema = Arc::new(Schema::new(Fields::from([Arc::new(Field::new(
"a",
data_type.clone(),
true,
))])));
let scan = Arc::new(EmptyExec::new(schema.clone()));
let scalar = lit(ScalarValue::try_from(data_type).unwrap());
let filter = Arc::new(
FilterExec::try_new(
Arc::new(BinaryExpr::new(
scalar,
datafusion::logical_expr::Operator::Eq,
col("a", &schema).unwrap(),
)),
scan,
)
.unwrap(),
);
roundtrip_test(filter).expect("roundtrip");
}In df46, this fails during deserialization of protobuf back to the physical plan:
thread 'cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict' panicked at datafusion/proto/tests/cases/roundtrip_physical_plan.rs:140:10:
from proto: Plan("DataFusion error: ArrowError(SchemaError(\"Invalid data for schema. Field { name: \\\"item\\\", data_type: Dictionary(UInt32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } refers to node not found in schema\"), None)")
In df52, it already fails during serialization, as there's no pre-allocated dict id available from the dictionary tracker
thread 'cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict' (145817) panicked at datafusion/proto/tests/cases/roundtrip_physical_plan.rs:148:14:
to proto: Plan("General error: Error encoding ScalarValue::List as IPC: Ipc error: no dict id for field item")
I am a bit puzzled on how that is supposed to work. From what I can tell, the DictionaryTracker seems to be initialized empty and then never modified - I don't see a possible code path where a dict_id is even created.
To Reproduce
Run the test above within the context of datafusion/proto/tests/cases/roundtrip_physical_plan.rs
Expected behavior
Test passes, protobuf roundtrip of physical plan succeeds.
Additional context
I believe it started failing with work related to https://github.com/apache/datafusion/pull/14227/changes