Skip to content

Design: Reducing Unused File Groups/Partitions in Distributed Tasks #449

@shehab-ali

Description

@shehab-ali

Problem Statement

Today, distributed task execution may retain full scan partition/file-group metadata even when a given task only executes a subset. This creates avoidable overhead in:

  • Worker memory footprint (deserialized plan metadata)
  • Network payload size (plan transfer)
  • Planning/debugging complexity (large explain trees)

I want to go over those three options that I found that we can consider:

  1. Shared stage plan + outbound per-task payload trimming
  2. Distinct task-specialized physical plan per task
  3. Task-local scan specialization (specialize scan nodes to task-owned partitions)

Goals

  • Reduce memory and network overhead from unused file groups/partitions.
  • Preserve correctness and existing distributed semantics.
  • Maintain debuggability and operational clarity.

Background (Current Behavior)

PartitionIsolatorExec exposes a subset of partitions to upstream nodes but can still keep a child plan that owns all original partitions. In practice, this can isolate execution while still retaining full scan metadata. DistributedTaskContext already provides task_index/task_count, and existing work-unit feed orchestration uses task and partition indexing conventions.

This means we already have ownership logic, but not always metadata-level pruning.


Option 1: Shared Stage Plan + Outbound Per-Task Payload Trimming

Idea

Keep a single canonical stage plan in coordinator memory. When sending work to each worker task, generate a task-projected serialized payload where non-relevant file groups/partitions are removed.

How

  • Keep coordinator-side Stage.plan shared.
  • At task dispatch/serialization time, apply a projection pass:
    • For scan nodes, keep only task-owned partitions.
    • Remap local partition indexes where required.
  • Worker receives and deserializes trimmed plan payload.

Pros

  • Good win with moderate change scope.
  • Minimizes coordinator architectural churn.
  • Immediate network + worker-memory reduction.

Cons/Risks

  • Requires careful partition remapping contracts.
  • Some logic exists both in canonical plan and projection path.
  • Must keep feed/shuffle assumptions consistent.

Best Use

  • Near-term optimization with low-to-medium implementation risk.

Option 2: Distinct Task-Specialized Physical Plan per Task

Idea

Materialize N distinct physical plans for a stage (one per task), each already specialized to task-owned partitions and potentially task-specific sub-optimizations.

How

  • During stage planning, compute task ownership maps.
  • Clone/rewrite stage plan per task.
  • Task receives its dedicated plan artifact.

Pros

  • Cleanest runtime model (what task sees is exactly what it runs).
  • Maximum specialization headroom beyond scans.
  • Easier future locality or cost-aware per-task optimization.

Cons/Risks

  • Highest planning CPU/memory overhead on coordinator.
  • Broader code changes and maintenance surface.
  • More complex debugging and test matrix (N plan variants).

Best Use

  • Long-term architecture when maximum optimization flexibility is desired.

Option 3: Task-Local Scan Specialization

Idea

Specialize only scan nodes (e.g., DataSourceExec/format scans) so each task owns a pruned set of file groups/partitions. Keep most of the stage plan architecture unchanged.

How

  • Reuse existing partition ownership mapping (task_index, task_count).
  • At plan rewrite time, replace scan nodes with task-pruned equivalents.
  • Upstream operators run on compact local partition indexing.

Pros

  • High impact/effort ratio.
  • Attacks the largest overhead source directly (scan metadata).
  • Less invasive than full per-task plan architecture.

Cons/Risks

  • Requires scan-type-aware rewrites.
  • Must validate interactions with operators expecting specific partition counts.

Best Use

  • First major optimization step, independent or combined with Option 1.

Comparative Summary

Dimension Option 1: Shared+Trimmed Payload Option 2: Distinct Per-Task Plan Option 3: Task-Local Scan Specialization
Change scope Medium Large Small-Medium
Worker memory reduction High Highest High
Network payload reduction High High Medium-High
Coordinator planning cost Low-Medium High Low-Medium
Implementation risk Medium High Medium
Long-term optimization headroom Medium-High Highest Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions