Skip to content

[IR Container] Phase 2 Copy-Move Semantics#5964

Draft
mdavis36 wants to merge 1 commit intomd/phase2-per-fusionfrom
md/phase2-copy-move
Draft

[IR Container] Phase 2 Copy-Move Semantics#5964
mdavis36 wants to merge 1 commit intomd/phase2-per-fusionfrom
md/phase2-copy-move

Conversation

@mdavis36
Copy link
Collaborator

Copy constructor now shares the source's container pointer instead of creating a new one. Fusion::copy clones directly from per-Fusion filtered vals rather than delegating to IrContainer::copy. Swap changed from content-based (IrContainer::swap) to pointer-based with per-Fusion ownership tracking for both same-container and different-container cases.

Copy constructor now shares the source's container pointer instead of
creating a new one. Fusion::copy clones directly from per-Fusion
filtered vals rather than delegating to IrContainer::copy. Swap changed
from content-based (IrContainer::swap) to pointer-based with per-Fusion
ownership tracking for both same-container and different-container cases.
@mdavis36
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 12, 2026

Description

  • Complete rewrite of Fusion::swap to use pointer-based approach with per-Fusion ownership tracking

  • Modified Fusion::copy to clone directly from per-Fusion filtered vals instead of delegating to IrContainer::copy

  • Changed copy constructor to share source's container pointer instead of creating new container

  • Added self-assignment check to move assignment operator for safety

Changes walkthrough

Relevant files
Enhancement
fusion.cpp
Phase 2 Copy-Move Semantics Implementation                             

csrc/fusion.cpp

  • Completely rewrote Fusion::swap function with pointer-based approach
    and per-Fusion ownership tracking
  • Modified Fusion::copy to clone vals directly instead of delegating to
    IrContainer::copy
  • Changed copy constructor to share source's container pointer
  • Added self-assignment check to move assignment operator
  • +84/-36 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Memory Safety

    The copy constructor now shares the source's container pointer (line 304), which could lead to dangling references if the source Fusion is destroyed while the copy still exists. This changes fundamental ownership semantics and requires careful review of lifetime management and container reference counting.

    Fusion::Fusion(const Fusion& other) : ir_container_(other.ir_container_) {
      FUSER_PERF_SCOPE("Fusion copy");
      ir_container_->addFusion(this);
      Fusion::copy(&other, this);
    }
    Complex Swap Logic

    The new swap implementation (lines 107-192) is significantly more complex with multiple phases of pointer updates and ownership transfers. The complexity increases the risk of bugs, especially in edge cases like self-swap or when containers are null. The logic for handling same-container vs different-container cases needs thorough validation.

    void Fusion::swap(Fusion& a, Fusion& b) noexcept {
      FUSER_PERF_SCOPE("Fusion swap");
    
      if (&a == &b) {
        return;
      }
    
      // Collect statements owned by each Fusion BEFORE swap
      std::vector<Val*> a_owned_vals, b_owned_vals;
      std::vector<Expr*> a_owned_exprs, b_owned_exprs;
    
      if (a.ir_container_) {
        const auto& av = a.ir_container_->valsOwnedBy(&a);
        const auto& ae = a.ir_container_->exprsOwnedBy(&a);
        a_owned_vals.assign(av.begin(), av.end());
        a_owned_exprs.assign(ae.begin(), ae.end());
      }
      if (b.ir_container_) {
        const auto& bv = b.ir_container_->valsOwnedBy(&b);
        const auto& be = b.ir_container_->exprsOwnedBy(&b);
        b_owned_vals.assign(bv.begin(), bv.end());
        b_owned_exprs.assign(be.begin(), be.end());
      }
    
      // Transfer Fusion registrations between containers before pointer swap.
      // After swap, a will own b's container and b will own a's container.
      if (a.ir_container_ && b.ir_container_ &&
          a.ir_container_.get() != b.ir_container_.get()) {
        a.ir_container_->transferFusion(&a, &b);
        b.ir_container_->transferFusion(&b, &a);
      }
    
      // Swap container pointers
      std::swap(a.ir_container_, b.ir_container_);
    
      // Swap all Fusion-level members
      std::swap(a.inputs_, b.inputs_);
      std::swap(a.outputs_, b.outputs_);
      std::swap(a.io_alias_, b.io_alias_);
      std::swap(a.all_tv_uses_valid_, b.all_tv_uses_valid_);
      std::swap(a.is_during_update_uses_, b.is_during_update_uses_);
      std::swap(a.managed_data_, b.managed_data_);
      std::swap(a.managed_named_data_, b.managed_named_data_);
      std::swap(a.expected_dynamic_smem_bytes_, b.expected_dynamic_smem_bytes_);
      std::swap(a.all_tvs_ptr_, b.all_tvs_ptr_);
      std::swap(a.zero_val_, b.zero_val_);
      std::swap(a.one_val_, b.one_val_);
      std::swap(a.true_val_, b.true_val_);
      std::swap(a.false_val_, b.false_val_);
      std::swap(a.magic_zero_val_, b.magic_zero_val_);
      std::swap(a.axioms_, b.axioms_);
      std::swap(a.metadata_, b.metadata_);
    
      // Update Statement::ir_container_ pointers: a's old statements now belong
      // to b, and b's old statements now belong to a
      for (auto* val : a_owned_vals) {
        val->ir_container_ = &b;
      }
      for (auto* expr : a_owned_exprs) {
        expr->ir_container_ = &b;
      }
      for (auto* val : b_owned_vals) {
        val->ir_container_ = &a;
      }
      for (auto* expr : b_owned_exprs) {
        expr->ir_container_ = &a;
      }
    
      // Update per-Fusion tracking keys in containers
      if (a.ir_container_ && b.ir_container_) {
        if (a.ir_container_.get() == b.ir_container_.get()) {
          // Same container: directly swap per-Fusion tracking entries
          auto* c = a.ir_container_.get();
          std::swap(c->per_fusion_vals_[&a], c->per_fusion_vals_[&b]);
          std::swap(c->per_fusion_exprs_[&a], c->per_fusion_exprs_[&b]);
        } else {
          // Different containers: rename tracking keys to match new owners
          a.ir_container_->transferStatementOwnership(&b, &a);
          b.ir_container_->transferStatementOwnership(&a, &b);
        }
      } else if (a.ir_container_) {
        a.ir_container_->transferStatementOwnership(&b, &a);
      } else if (b.ir_container_) {
        b.ir_container_->transferStatementOwnership(&a, &b);
      }
    }
    Missing Tests

    No test changes are visible in the diff despite fundamental changes to copy and swap semantics. These are core operations that affect memory management and object lifecycle. Comprehensive tests should be added to verify correct behavior, especially for edge cases like self-assignment, container sharing, and ownership transfers.

    IrCloner Fusion::copy(const Fusion* from, Fusion* to) {
      to->clear();
    
      IrCloner ir_cloner(to);
    
      // Clone from's vals in insertion order
      for (auto val : from->deterministic_vals()) {
        ir_cloner.clone(val);
      }
    
      // Wire up definitions and uses on cloned vals
      for (auto val : from->vals()) {
        ir_cloner.clone(val)->setDefinition(ir_cloner.clone(val->definition_));
        ir_cloner.clone(val)->setUses(ir_cloner.clone(val->uses_));
      }
    
      // Remap cached special val pointers
      if (from->zero_val_) {
        to->zero_val_ = ir_cloner.clone(from->zero_val_);
      }
      if (from->one_val_) {
        to->one_val_ = ir_cloner.clone(from->one_val_);
      }
      if (from->true_val_) {
        to->true_val_ = ir_cloner.clone(from->true_val_);
      }
      if (from->false_val_) {
        to->false_val_ = ir_cloner.clone(from->false_val_);
      }
      if (from->magic_zero_val_) {
        to->magic_zero_val_ =
            ir_cloner.clone(from->magic_zero_val_)->as<NamedScalar>();
      }
    
      to->inputs_ = ir_cloner.clone(from->inputs_);
      to->outputs_ = ir_cloner.clone(from->outputs_);
      for (auto inp : to->inputs_) {
        inp->setIsFusionInput(true);
      }
      for (auto out : to->outputs_) {
        out->setIsFusionOutput(true);
      }
    
      for (Val* out : from->outputs_) {
        const AliasInfo& alias = from->io_alias_.get(out);
        if (alias.type == AllocationType::New) {
          continue;
        }
    
        Val* copied_out = ir_cloner.clone(out);
        Val* copied_in = ir_cloner.clone(alias.aliased_io);
        to->io_alias_.add(copied_out, copied_in, alias.type, alias.visibility);
      }
    
      to->all_tv_uses_valid_ = from->all_tv_uses_valid_;
      to->is_during_update_uses_ = from->is_during_update_uses_;
    
      for (const auto& i : from->managed_data_) {
        if (i.first.has_value()) {
          to->managed_data_.emplace_back(i.second(ir_cloner, i.first), i.second);
        } else {
          to->managed_data_.emplace_back(i.first, i.second);
        }
      }

    Test failures

    • (High, 195) CUDA InvalidAddressSpace errors in nvFuser direct test_repro suite

      Test Name A100 GB200 H100 Source
      tests.python.direct.test_repro.test_issue1246[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue1270[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue1270[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_issue1273[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue1273[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_issue1277[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue1277[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_issue1279[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue1279[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_issue1310[nvfuser_direct_test=eager]
      ... with 57 more test failures omitted. Check internal logs.
    • (High, 66) CUDA uniform_ random kernel ‘InvalidAddressSpace’ errors across NVFuser repro tests

      Test Name A100 GB200 H100 Source
      tests.python.direct.test_repro.test_ca_map_concrete_loop_id[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_ca_map_concrete_loop_id[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_domain_map_hang[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_domain_map_hang[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_issue3292[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue3292[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_issue3369[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue3369[nvfuser_direct_test=lru_cache]
      tests.python.direct.test_repro.test_issue5377[nvfuser_direct_test=eager]
      tests.python.direct.test_repro.test_issue5377[nvfuser_direct_test=lru_cache]
      ... with 13 more test failures omitted. Check internal logs.
    • (High, 3) NVFuser CUDA_ERROR_INVALID_ADDRESS_SPACE in tests.python.direct.test_repro.test_issue1246

      Test Name A100 GB200 H100 Source
      tests.python.direct.test_repro.test_issue1246[nvfuser_direct_test=lru_cache]
    • (High, 1) CUDA unavailable during Llama4MoE initialization in tests.python.test_moe on dlcluster_h100

      Test Name H100 Source
      tests.python.test_moe.test_llama4_moe_thunderfx
    • (Medium, 137) NVFuser internal assertion/validation failures (alias_memory.cpp:835 & validator_utils.cpp) across multiple nvFuser test suites

      Test Name A100 GB200 H100 Source
      DynamicTransformTest.DynamicSqueezeTrivialReduction Link
      DynamicTransformTest.DynamicSqueezeTrivialWelford Link
      DynamicTransformTest.DynamicTransform3 Link
      DynamicTransformTest.DynamicTransformFusionExecutorCache Link
      DynamicTransformTest.DynamicTransformIssue418 Link
      Gpu2Test.FusionBNRepro2_CUDA Link
      Gpu2Test.FusionVarMean_CUDA Link
      InsertReshardingTest.Execute/1 Link
      MoveRepeatForwardTest.MoveOverRotation Link
      MoveRepeatForwardTest.Simple Link
      ... with 40 more test failures omitted. Check internal logs.
    • (Medium, 61) NVFuser codegen errors (duplicate symbols / alias_memory asserts) in multiple python/direct and multidevice test modules

      Test Name A100 A100 (dist.) GB200 GB200 (dist.) H100 H100 (dist.) Source
      tests.python.direct.test_python_frontend.test_output_stride_order_with_reduction[nvfuser_direct_test=eager]
      tests.python.direct.test_python_frontend.test_output_stride_order_with_reduction[nvfuser_direct_test=lru_cache]
      tests.python.multidevice.test_communication.test_allreduce
      tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous
      tests.python.multidevice.test_matmul.test_linear_reduce_scatter
      tests.python.multidevice.test_matmul.test_row_parallel_linear_with_bias
      tests.python.multidevice.test_multidevice.test_privatize_squeeze
      tests.python.opinfo.test_direct_ops.test_correctness_var_mean_float32
      thunder.tests.test_grad.test_phantom_grad_vs_torch_consistency_softmin_nvfuser_cuda_thunder.dtypes.bfloat16
      thunder.tests.test_grad.test_phantom_grad_vs_torch_consistency_softmin_nvfuser_cuda_thunder.dtypes.float16
      ... with 9 more test failures omitted. Check internal logs.
    • (Medium, 12) NVFuser codegen duplicate symbol compile errors across Alias, DynamicTransform, InsertResharding, Reshape test suites

      Test Name A100 GB200 H100 Source
      AliasTest.Bookend_Issue2375 Link
      DynamicTransformTest.FusionDynamicReshapeReductionShmoo Link
      Gpu2Test.FusionVarMean_CUDA Link
      InsertReshardingTest.Execute/3 Link
      ReshapeTest.ReductionFlatten1 Link
    • (Medium, 8) nvFuser alias_memory internal assertion failures in MatmulNodeParameterizedTest on A100

      Test Name A100 Source
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeConcrete/11 Link
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeConcrete/15 Link
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeConcrete/17 Link
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeSymbolic/12 Link
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeSymbolic/13 Link
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeSymbolic/14 Link
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeSymbolic/19 Link
      ReductionAxisIsOne/MatmulNodeParameterizedTest.MatmulNodeSymbolic/3 Link
    • (Medium, 8) nvFuser multi-device communication result mismatches (reduce_scatter_noncontiguous & insert_resharding_after tests)

      Test Name A100 A100 (dist.) GB200 GB200 (dist.) H100 (dist.) Source
      tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous
      tests.python.multidevice.test_multidevice.test_insert_resharding_after
    • (Medium, 6) Phantom grad vs torch consistency mismatches in Thunder NVFuser (bf16/fp16)

      Test Name A100 GB200 H100 Source
      thunder.tests.test_grad.test_phantom_grad_vs_torch_consistency_outer_nvfuser_cuda_thunder.dtypes.bfloat16
      thunder.tests.test_grad.test_phantom_grad_vs_torch_consistency_outer_nvfuser_cuda_thunder.dtypes.float16
    • (Medium, 3) Thunder nvFuser VJP correctness mismatch for take (float64) on CUDA

      Test Name A100 GB200 H100 Source
      thunder.tests.test_grad.test_vjp_correctness_take_nvfuser_cuda_thunder.dtypes.float64
    • (Medium, 3) NVFuser IndexingTest.Reshape inline-expression mismatch across multiple GPU runners

      Test Name A100 GB200 H100 Source
      IndexingTest.Reshape Link

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant