Add vmem-based allocator by mawad-amd · Pull Request #377 · ROCm/iris

mawad-amd · 2026-02-17T20:07:55Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

This PR introduces a HIP VMem-based allocator path for Iris’ symmetric heap, adds an as_symmetric() workflow for importing external PyTorch tensors via DMA-BUF into controlled virtual address space, and adds an extensive test suite covering ROCr/HIP behaviors and VMem/DMA-BUF workflows.

Changes:

Add VMemAllocator and extend SymmetricHeap/Iris to support allocator_type selection and external-tensor import (as_symmetric).
Add HIP VMem primitives + external-memory lifetime APIs (e.g., destroy_external_memory) and DMA-BUF offset-handling tests.
Add many unit tests for segmented export/import, cumulative mem_set_access workaround, peer exchange, and end-to-end imported-tensor usage; disable a flaky float32 matmul config.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 21 comments.

Show a summary per file

File	Description
tests/unittests/test_vmem_segmented_export.py	New test for segmented DMA-BUF export/import across multiple VMem allocations.
tests/unittests/test_vmem_peer_dmabuf_exchange.py	New tests for peer DMA-BUF exchange (single-rank simulation + multi-rank).
tests/unittests/test_vmem_offset_check.py	New multi-rank test checking imported-tensor offset symmetry.
tests/unittests/test_vmem_multi_alloc_export.py	New test documenting failure/limitation when exporting multi-allocation VA ranges as one DMA-BUF.
tests/unittests/test_vmem_minimal_export_with_ondemand.py	New tests for “minimal export + on-demand mapping” allocator strategy.
tests/unittests/test_vmem_imported_tensor_rma.py	New end-to-end tests using imported tensors with Triton kernels and Iris load/RMA.
tests/unittests/test_vmem_cumulative_access.py	New tests validating ROCm cumulative `mem_set_access` workaround.
tests/unittests/test_vmem_allocator.py	New unit tests for creating/using VMem allocator and importing tensors.
tests/unittests/test_rocr_behaviors.py	New tests codifying ROCr/DMA-BUF behaviors Iris relies on (base-allocation export, cumulative access).
tests/unittests/test_pytorch_import_mechanism.py	New tests for offset-preserving DMA-BUF import mechanism used by `as_symmetric()`.
tests/unittests/test_pytorch_dmabuf_export.py	New tests for exporting PyTorch allocations to DMA-BUF (including slices/dtypes).
tests/unittests/test_hip_vmem_primitives.py	New tests for low-level HIP VMem reserve/map/set_access/unmap primitives.
tests/unittests/test_hip_apis.py	New tests for HIP API helpers like `get_address_range()` across dtypes/shapes.
tests/unittests/test_dmabuf_vmem_import.py	New tests for importing DMA-BUF into reserved VMem VA ranges and mixing with native allocations.
tests/unittests/test_dmabuf_controlled_va_import.py	New tests for controlled-VA DMA-BUF import including offset preservation and cleanup.
tests/unittests/test_dmabuf_apis.py	Update DMA-BUF import tests to handle external-memory lifetime (`destroy_external_memory`) and new return type.
tests/ops/test_matmul_all_gather.py	Disable float32 test case due to AMD Triton backend issue.
iris/tensor_utils.py	New helper for wrapping raw GPU pointers via `__cuda_array_interface__`.
iris/symmetric_heap.py	Add allocator selection + new segmented peer-refresh workflow + `as_symmetric()` plumbing.
iris/iris.py	Add `allocator_type` to public API + `as_symmetric()` method + best-effort destructor cleanup.
iris/hip.py	Extend HIP bindings: external-memory destroy API, `get_address_range`, and full HIP VMem API surface.
iris/allocators/vmem_allocator.py	New VMem allocator implementation using HIP VMem + DMA-BUF import of external tensors.
iris/allocators/torch_allocator.py	Minor updates; retains FD exchange code paths for torch-backed allocator.
iris/allocators/base.py	Simplify allocator base interface; move multi-rank coordination responsibility into `SymmetricHeap`.
iris/allocators/init.py	Export `VMemAllocator`.
iris/init.py	Re-export tensor utils and tidy imports/exports.
.gitignore	Ignore resources/log outputs.

Comments suppressed due to low confidence (1)

iris/iris.py:69

The docstring says allocator_type default is 'torch', but the actual default argument is 'vmem'. Please align the documentation with the actual default (or change the default to match the documented behavior) to avoid surprising API changes for callers.

from iris.symmetric_heap import SymmetricHeap
import numpy as np
import math
import torch
import logging

# Import logging functionality from the separate logging module
from .logging import logger

# Import tracing functionality
from .tracing import Tracing, TraceEvent, DeviceTracing  # noqa: F401  re-export for iris.TraceEvent


class Iris:

Copilot · 2026-02-17T20:16:07Z

iris/symmetric_heap.py

+        tensor = self.allocator.allocate(actual_elements, dtype, alignment)
+        tensor = tensor[:num_elements]
+        self.refresh_peer_access()
+        return tensor


refresh_peer_access() is called after every allocate(), and it performs distributed barriers + FD exchange even if no new peer mappings are needed. This can make small allocations extremely expensive. Consider refreshing only when new segments were added, and/or batching refresh at explicit sync points rather than every allocation.

Copilot · 2026-02-17T20:16:08Z

tests/unittests/test_hip_vmem_primitives.py

+    base_va = mem_address_reserve(va_size, 0, granularity)
+


mem_address_reserve signature is (size, alignment=0, addr=0, flags=0). Passing (va_size, 0, granularity) sets addr=granularity, not alignment. Swap the arguments to use alignment=granularity (e.g., mem_address_reserve(va_size, granularity, 0)).

Copilot · 2026-02-17T20:16:08Z

tests/unittests/test_hip_vmem_primitives.py

+    alloc_size = 2 << 20
+    va_size = 4 << 20
+
+    base_va = mem_address_reserve(va_size, 0, granularity)


mem_address_reserve signature is (size, alignment=0, addr=0, flags=0). Passing (va_size, 0, granularity) sets addr=granularity, not alignment. Swap the arguments to use alignment=granularity (e.g., mem_address_reserve(va_size, granularity, 0)).

Suggested change

base_va = mem_address_reserve(va_size, 0, granularity)

base_va = mem_address_reserve(va_size, granularity, 0)

Copilot · 2026-02-17T20:16:08Z

tests/unittests/test_dmabuf_controlled_va_import.py

+    granularity = get_allocation_granularity(device_id)
+    va_size = 4 << 20
+
+    base_va = mem_address_reserve(va_size, 0, granularity)


mem_address_reserve signature is (size, alignment=0, addr=0, flags=0). Passing (va_size, 0, granularity) sets addr=granularity, not alignment. Swap the arguments to use alignment=granularity (e.g., mem_address_reserve(va_size, granularity, 0)).

Suggested change

base_va = mem_address_reserve(va_size, 0, granularity)

base_va = mem_address_reserve(va_size, granularity, 0)

Copilot · 2026-02-17T20:16:09Z

tests/unittests/test_dmabuf_controlled_va_import.py

+    granularity = get_allocation_granularity(device_id)
+    va_size = 8 << 20
+
+    base_va = mem_address_reserve(va_size, 0, granularity)


mem_address_reserve signature is (size, alignment=0, addr=0, flags=0). Passing (va_size, 0, granularity) sets addr=granularity, not alignment. Swap the arguments to use alignment=granularity (e.g., mem_address_reserve(va_size, granularity, 0)).

Suggested change

base_va = mem_address_reserve(va_size, 0, granularity)

base_va = mem_address_reserve(va_size, granularity, 0)

Copilot · 2026-02-17T20:16:12Z

iris/allocators/torch_allocator.py

                    with managed_fd(peer_handle):
-                        # Import peer's memory via DMA-BUF with proper offset correction
-                        # peer_heap is where their heap starts (what they want us to use)
-                        # peer_base is the base of their allocation buffer
-                        # peer_size is the size of their allocation buffer
-                        mapped_addr = import_dmabuf_handle(
-                            peer_handle,
-                            peer_size,  # Import the full base allocation
-                            peer_heap,  # Original heap pointer (for offset calculation)
-                            peer_base,  # Base of allocation (for offset calculation)
-                        )
+                        mapped_addr = import_dmabuf_handle(peer_handle, peer_size, peer_heap, peer_base)
                        heap_bases_array[peer] = mapped_addr


import_dmabuf_handle() now returns (mapped_ptr, ext_mem_handle), but this code stores the tuple directly into heap_bases_array. Also, ext_mem_handle must be destroyed to avoid leaking external memory handles. Unpack the return value and ensure destroy_external_memory() is called during teardown (or store the handles for later cleanup).

Copilot · 2026-02-17T20:16:12Z

tests/unittests/test_hip_vmem_primitives.py

+    base_va = mem_address_reserve(va_size, 0, granularity)
+    assert base_va > 0


mem_address_reserve signature is (size, alignment=0, addr=0, flags=0). Passing (va_size, 0, granularity) sets addr=granularity, not alignment. Swap the arguments to use alignment=granularity (e.g., mem_address_reserve(va_size, granularity, 0)).

Copilot · 2026-02-17T20:16:12Z

tests/unittests/test_dmabuf_controlled_va_import.py

+    granularity = get_allocation_granularity(device_id)
+    va_size = 4 << 20
+
+    base_va = mem_address_reserve(va_size, 0, granularity)


mem_address_reserve signature is (size, alignment=0, addr=0, flags=0). Passing (va_size, 0, granularity) sets addr=granularity, not alignment. Swap the arguments to use alignment=granularity (e.g., mem_address_reserve(va_size, granularity, 0)).

Suggested change

base_va = mem_address_reserve(va_size, 0, granularity)

base_va = mem_address_reserve(va_size, granularity, 0)

Copilot · 2026-02-17T20:16:12Z

iris/allocators/vmem_allocator.py

+            cuda_array = CUDAArrayInterface(tensor_va, tensor_size, self.device)
+            tensor_bytes = torch.as_tensor(cuda_array, device=self.device)
+            imported_tensor = tensor_bytes.view(external_tensor.dtype).reshape(external_tensor.shape)
+


This import path assumes the external tensor is contiguous and recreates it via .view(dtype).reshape(shape), which will silently produce incorrect results for non-contiguous tensors (e.g., transposes/slices with strides). Either require external_tensor.is_contiguous() (and raise otherwise) or preserve strides/storage_offset when constructing the imported view.

Copilot · 2026-02-17T20:16:13Z

iris/symmetric_heap.py

+        all_bases_arr = distributed_allgather(local_base_arr).reshape(self.num_ranks).astype(np.uint64)
+        self.heap_bases[self.cur_rank] = int(all_bases_arr[self.cur_rank])
+
+        if self.num_ranks == 1 or self.fd_conns is None:
+            return
+
+        if not hasattr(self.allocator, "get_allocation_segments"):


In multi-rank runs with allocator_type='torch', this early return prevents populating heap_bases for peer ranks (they stay 0), which will break pointer translation/RMA. Either implement a TorchAllocator peer-import path here (FD exchange + import_dmabuf_handle) or raise/skip when allocator doesn't support peer access in SymmetricHeap.

Suggested change

all_bases_arr = distributed_allgather(local_base_arr).reshape(self.num_ranks).astype(np.uint64)

self.heap_bases[self.cur_rank] = int(all_bases_arr[self.cur_rank])

if self.num_ranks == 1 or self.fd_conns is None:

return

if not hasattr(self.allocator, "get_allocation_segments"):

all_bases_arr = (

distributed_allgather(local_base_arr)

.reshape(self.num_ranks)

.astype(np.uint64)

)

# Populate heap_bases for all ranks based on the allgather result.

for rank in range(self.num_ranks):

self.heap_bases[rank] = int(all_bases_arr[rank])

if self.num_ranks == 1 or self.fd_conns is None:

return

if not hasattr(self.allocator, "get_allocation_segments"):

# In multi-rank runs with active FD connections, we require

# allocator support for segmented export/import to establish

# peer mappings. Failing silently here would leave peer access

# unusable even though heap_bases were populated.

if self.num_ranks > 1 and self.fd_conns is not None:

raise RuntimeError(

f"{type(self.allocator).__name__} does not support peer "

"access for multi-rank symmetric heaps. Use "

"allocator_type='vmem' to enable RMA/peer access."

)

mawad-amd added 3 commits February 4, 2026 15:45

Add vmem allocator

fd4919b

Address linter errors

cd87d73

Tests and regular export/sync

2388ae3

Copilot AI review requested due to automatic review settings February 17, 2026 20:07

mawad-amd requested review from BKP and neoblizz as code owners February 17, 2026 20:07

Copilot started reviewing on behalf of mawad-amd February 17, 2026 20:09 View session

Merge main into muhaawad/pyvmem-allocator

175942d

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vmem-based allocator#377

Add vmem-based allocator#377
mawad-amd wants to merge 4 commits intomainfrom
muhaawad/pyvmem-allocator

mawad-amd commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	base_va = mem_address_reserve(va_size, 0, granularity)
	base_va = mem_address_reserve(va_size, granularity, 0)

		base_va = mem_address_reserve(va_size, 0, granularity)
		assert base_va > 0

-        all_bases_arr = distributed_allgather(local_base_arr).reshape(self.num_ranks).astype(np.uint64)
-        self.heap_bases[self.cur_rank] = int(all_bases_arr[self.cur_rank])
-        if self.num_ranks == 1 or self.fd_conns is None:
-            return
-        if not hasattr(self.allocator, "get_allocation_segments"):
+        all_bases_arr = (
+            distributed_allgather(local_base_arr)
+            .reshape(self.num_ranks)
+            .astype(np.uint64)
+        )
+        # Populate heap_bases for all ranks based on the allgather result.
+        for rank in range(self.num_ranks):
+            self.heap_bases[rank] = int(all_bases_arr[rank])
+        if self.num_ranks == 1 or self.fd_conns is None:
+            return
+        if not hasattr(self.allocator, "get_allocation_segments"):
+            # In multi-rank runs with active FD connections, we require
+            # allocator support for segmented export/import to establish
+            # peer mappings. Failing silently here would leave peer access
+            # unusable even though heap_bases were populated.
+            if self.num_ranks > 1 and self.fd_conns is not None:
+                raise RuntimeError(
+                    f"{type(self.allocator).__name__} does not support peer "
+                    "access for multi-rank symmetric heaps. Use "
+                    "allocator_type='vmem' to enable RMA/peer access."
+                )

Conversation

mawad-amd commented Feb 17, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant