Skip to content

Register CometArrowAllocator as a Spark MemoryConsumer for JVM-UDF dispatch #4174

@andygrove

Description

@andygrove

Describe the problem

CometUdfBridge.evaluate (common/src/main/java/org/apache/comet/udf/CometUdfBridge.java, on the JVM-scalar-UDF prototype branch) allocates output Arrow vectors via the project-wide CometArrowAllocator. That allocator is a RootAllocator that is not registered with Spark's TaskMemoryManager, so off-heap memory consumed by the UDF dispatch path is invisible to Spark's task memory accounting and back-pressure machinery.

Under workloads with many concurrent JVM-UDF tasks per executor, this can drive native off-heap usage past the operator-level limits Spark would otherwise enforce.

Describe the potential solution

Either:

  1. Register CometArrowAllocator as a MemoryConsumer in Spark's TaskMemoryManager so allocations and frees update the task's accounting.
  2. Allocate UDF output vectors from a child allocator that is itself registered as a per-task consumer, so leakage and accounting stay scoped to the task.

Option (2) is closer to the existing Spark-Arrow integration pattern.

Additional context

Identified during code review of the JVM-scalar-UDF prototype. Filed as a follow-up so the prototype PR can ship without a Spark-integration redesign.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:expressionsExpression evaluationarea:ffiArrow FFI / JNI boundarypriority:mediumFunctional bugs, performance regressions, broken features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions