Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 10 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Currently, we support the following storage backends:

- SimpleStorage: A basic CPU memory storage with minimal data format constraints and easy usability.
- [Yuanrong](https://gitee.com/openeuler/yuanrong-datasystem) (beta, [#PR107](https://github.com/TransferQueue/TransferQueue/pull/107), [#PR96](https://github.com/TransferQueue/TransferQueue/pull/96)): An Ascend native data system that provides hierarchical storage interfaces including HBM/DRAM/SSD.
- [MooncakeStore](https://github.com/kvcache-ai/Mooncake) (alpha, [#PR162](https://github.com/TransferQueue/TransferQueue/pull/162)): A high-performance, KV-based hierarchical storage that supports RDMA transport between GPU and DRAM.
- [MooncakeStore](https://github.com/kvcache-ai/Mooncake) (beta, [#PR162](https://github.com/TransferQueue/TransferQueue/pull/162)): A high-performance, KV-based hierarchical storage that supports RDMA transport between GPU and DRAM.
- [RayRDT](https://docs.ray.io/en/master/ray-core/direct-transport.html) (alpha, [#PR167](https://github.com/TransferQueue/TransferQueue/pull/167)): Ray's new feature that allows Ray to store and pass objects directly between Ray actors.

Among them, `SimpleStorageUnit` serves as our default storage backend, coordinated by the `AsyncSimpleStorageManager` class. Each storage unit can be deployed on a separate node, allowing for distributed data management.
Expand Down Expand Up @@ -121,6 +121,8 @@ To simplify the usage of TransferQueue, we have provided a Redis-style high-leve
- **Metadata Tags**: Lightweight metadata for status tracking
- **Pluggable Backends**: Supports multiple backends

Refer to [tutorials/basic.ipynb](https://github.com/Ascend/TransferQueue/blob/main/tutorial/basic.ipynb) and [tutorials/02_kv_interface.py](https://github.com/Ascend/TransferQueue/blob/main/tutorial/02_kv_interface.py) for detailed usage examples.

#### StreamingDataLoader API

Designed as a drop-in replacement for the standard PyTorch `DataLoader`, this API allows each rank to automatically consume data without single-controller intervention.
Expand All @@ -147,17 +149,12 @@ Developers can leverage `TransferQueueClient` directly to implement advanced fea
#### verl
The primary motivation for integrating TransferQueue to verl now is to **alleviate the data transfer bottleneck of the single controller `RayPPOTrainer`**. Currently, all `DataProto` objects must be routed through `RayPPOTrainer`, resulting in a single point bottleneck of the whole post-training system.

![verl_dataflow_DataProto](https://github.com/TransferQueue/community_doc/blob/main/docs/verl_workflow.jpeg?raw=true)

Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by

- Replacing `DataProto` with `BatchMeta` (metadata) and `TensorDict` (actual data) structures
- Preserving verl's original Dispatch/Collect logic via BatchMeta (maintaining single-controller debuggability)
- Accelerating data transfer by TransferQueue's distributed storage units
<p align="center">
<img src="https://raw.githubusercontent.com/wuxibin89/verl/refs/heads/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">

Copilot AI Apr 1, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image URL uses raw.githubusercontent.com/.../refs/heads/..., which typically does not resolve because the raw URL expects a git ref/branch name (e.g. .../main/... or .../<branch>/...), not the full refs/heads/... path. This will likely render as a broken image in the README; consider switching to a standard raw URL (with the actual branch name) or using a github.com/.../blob/...?...raw=true link (preferably in an official/stable repo).

Suggested change
<img src="https://raw.githubusercontent.com/wuxibin89/verl/refs/heads/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">
<img src="https://raw.githubusercontent.com/wuxibin89/verl/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">

Copilot uses AI. Check for mistakes.
</p>

![verl_dataflow_TransferQueue](https://github.com/TransferQueue/community_doc/blob/main/docs/verl_workflow_with_tq.jpeg?raw=true)
Official integration to verl is available at [verl/pulls/5401](https://github.com/verl-project/verl/pull/5401), with design doc at [[RFC] PPOTrainer with TransferQueue Integration](https://github.com/verl-project/verl/issues/5400). You may also refer to our [recipe](https://github.com/Ascend/TransferQueue/blob/main/recipe/simple_use_case/single_controller_demo.py), where we mimic the verl usage in a high-level manner.

You may refer to the [recipe](https://github.com/Ascend/TransferQueue/tree/dev/recipe/simple_use_case), where we mimic the verl usage in both async & sync scenarios. Official integration to verl is also available now at [verl/pulls/3649](https://github.com/volcengine/verl/pull/3649) (with subsequent PRs to further optimize the integration).

### Disaggregated Example

Expand Down Expand Up @@ -216,11 +213,11 @@ pip install TransferQueue
<img src="https://github.com/TransferQueue/community_doc/blob/main/docs/performance_0.1.1.dev2.png?raw=true" width="100%">
</p>

> Note: The above benchmark for TransferQueue is based on our naive `SimpleStorage` backend. By introducing high-performance storage backends and optimizing serialization/deserialization, we expect to achieve even better performance. Warmly welcome contributions from the community!
> Note: Optimization for MooncakeStore and other backends are still in process. Warmly welcome contributions from the community!

For detailed performance benchmarks, please refer to [this blog](https://www.yuque.com/haomingzi-lfse7/hlx5g0/tml8ke0zkgn6roey?singleDoc#).
For detailed performance benchmarks, please refer to [this blog](https://www.yuque.com/haomingzi-lfse7/lhp4el/tml8ke0zkgn6roey?singleDoc#).

We also provide a [stress test report](https://www.yuque.com/haomingzi-lfse7/hlx5g0/ydbwgo5k2umaag78?singleDoc#) that demonstrates **768 concurrent clients writing 1.4 TB of data** into TransferQueue across 4 nodes. The system remains stable without any crashes or data loss, achieving 80% bandwidth.
We also provide a [stress test report](https://www.yuque.com/haomingzi-lfse7/lhp4el/mt0vedqy7c337pgg?singleDoc#) that demonstrates more than **8192 concurrent clients writing 2 TB of data** into TransferQueue across 4 nodes. The system remains stable without any crashes or data loss.

<h2 id="customize"> 🛠️ Customize TransferQueue</h2>

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ yuanrong = [
"openyuanrong-datasystem"
]
mooncake = [
"mooncake-transfer-engine"
"mooncake-transfer-engine==0.3.10.post1"

Copilot AI Apr 1, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning mooncake-transfer-engine to an exact version in an optional extra can make dependency resolution brittle for downstream users (e.g., if they already depend on a compatible patch release). Unless you specifically require only 0.3.10.post1, consider using a compatible range (e.g. >=0.3.10.post1,<0.4) or ~=0.3.10 to allow patch-level upgrades while still enforcing a minimum version.

Suggested change
"mooncake-transfer-engine==0.3.10.post1"
"mooncake-transfer-engine>=0.3.10.post1,<0.4"

Copilot uses AI. Check for mistakes.
]

# If you need to mimic `package_dir={'': '.'}`:
Expand Down
68 changes: 57 additions & 11 deletions scripts/performance_test/README_PERFTEST.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,19 +42,63 @@ python perftest.py \
| `--head_node_ip` | Head node IP address | - | Yes |
| `--worker_node_ip` | Worker node IP address (required for Yuanrong) | None | No |
| `--output_csv` | Path to output CSV file | None | No |
| `--use_complex_case` | Use complex test case with nested tensors and NonTensorStack fields | False | No |

## Backend Configuration

The script reads the backend configuration directly from the provided `--backend_config` YAML file. The backend type is determined by `backend.storage_backend` in the config file. When `--backend` is specified, it overrides the value in the config.

For device support of each backend:
- `SimpleStorage`: `cpu`
- `Yuanrong`: `cpu`, `npu`
- `MooncakeStore`: `cpu`, `gpu`
### SimpleStorage Configuration

## Test Data Format
```yaml
backend:
storage_backend: SimpleStorage
SimpleStorage:
total_storage_size: 100000
num_data_storage_units: 16
```

### Yuanrong Configuration

```yaml
backend:
storage_backend: Yuanrong
Yuanrong:
port: 31501
enable_yr_npu_transport: true
```

For Yuanrong backend, writer runs on the head node and reader runs on the worker node. `--worker_node_ip` is required.

### MooncakeStore Configuration

```yaml
backend:
storage_backend: MooncakeStore
MooncakeStore:
auto_init: true
metadata_server: localhost:50050
master_server_address: localhost:50051
local_hostname: ""
protocol: rdma
global_segment_size: 86294967296
local_buffer_size: 86294967296
device_name: ""
```

## Test Scenarios

### Simple Test Case (Default)

When `--use_complex_case` is **not** specified (default), the test creates a `TensorDict` with only regular tensors:

The test case creates a `TensorDict` with three types of fields to simulate real training batches:
- **Regular tensors**: Shape `(batch_size, seq_length)`, float32.

Each regular tensor field size = `batch_size × seq_length × 4` bytes.

### Complex Test Case

When `--use_complex_case` is specified, the test creates a `TensorDict` with three types of fields to simulate real training batches:

1. **Regular tensors**: Shape `(batch_size, seq_length)`, float32.
2. **Nested tensors** (non-NPU devices): Variable-length ragged sequences with lengths forming an arithmetic progression from 1 to `seq_length`. Average length ≈ `seq_length / 2`, so each nested field is roughly half the size of a regular field.
Expand All @@ -73,10 +117,6 @@ Each iteration performs a PUT → LIST → GET → DELETE cycle via TransferQueu

The test runs `--num_test_iterations` iterations. Data creation only happens in the first iteration; subsequent iterations reuse the same TensorDict to isolate transfer overhead.

## Yuanrong Backend

For Yuanrong backend, writer runs on the head node and reader runs on the worker node. `--worker_node_ip` is required.

## Running Full Test Suite

The `run_perf_test.sh` script automates the full test suite across all backends and data sizes, then generates a comparison chart:
Expand Down Expand Up @@ -130,12 +170,18 @@ After running the tests, `draw_figure.py` reads all CSV files from `results/` an

## Examples

### SimpleStorage backend
### SimpleStorage backend (simple case)
```bash
python perftest.py --backend_config=perftest_config.yaml --backend=SimpleStorage \
--head_node_ip=192.168.0.1
```

### SimpleStorage backend (complex case)
```bash
python perftest.py --backend_config=perftest_config.yaml --backend=SimpleStorage \
--head_node_ip=192.168.0.1 --use_complex_case
```

### Yuanrong backend (inter-node)
```bash
python perftest.py --backend_config=perftest_config.yaml --backend=Yuanrong \
Expand Down
2 changes: 1 addition & 1 deletion transfer_queue/version/version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.1.6.dev0
0.1.6

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
0.1.6
0.1.7.dev

We recommend creating the release/v0.1.6 branch and advancing the version on the main branch to 0.1.7.dev.

@0oshowero0 0oshowero0 Apr 1, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0.1.6 is not published yet. I plan to change it to 0.1.6 first, make release branch, then create a new PR that changes the main branch to 0.1.7.dev

Loading