Ascend · 0oshowero0 · Apr 2, 2026 · Mar 30, 2026 · Mar 30, 2026 · Apr 1, 2026
diff --git a/README.md b/README.md
@@ -76,7 +76,7 @@ Currently, we support the following storage backends:
 
 - SimpleStorage: A basic CPU memory storage with minimal data format constraints and easy usability.
 - [Yuanrong](https://gitee.com/openeuler/yuanrong-datasystem) (beta, [#PR107](https://github.com/TransferQueue/TransferQueue/pull/107), [#PR96](https://github.com/TransferQueue/TransferQueue/pull/96)): An Ascend native data system that provides hierarchical storage interfaces including HBM/DRAM/SSD.
-- [MooncakeStore](https://github.com/kvcache-ai/Mooncake) (alpha, [#PR162](https://github.com/TransferQueue/TransferQueue/pull/162)): A high-performance, KV-based hierarchical storage that supports RDMA transport between GPU and DRAM.
+- [MooncakeStore](https://github.com/kvcache-ai/Mooncake) (beta, [#PR162](https://github.com/TransferQueue/TransferQueue/pull/162)): A high-performance, KV-based hierarchical storage that supports RDMA transport between GPU and DRAM.
 - [RayRDT](https://docs.ray.io/en/master/ray-core/direct-transport.html) (alpha, [#PR167](https://github.com/TransferQueue/TransferQueue/pull/167)): Ray's new feature that allows Ray to store and pass objects directly between Ray actors.
 
 Among them, `SimpleStorageUnit` serves as our default storage backend, coordinated by the `AsyncSimpleStorageManager` class. Each storage unit can be deployed on a separate node, allowing for distributed data management.
@@ -121,6 +121,8 @@ To simplify the usage of TransferQueue, we have provided a Redis-style high-leve
 - **Metadata Tags**: Lightweight metadata for status tracking
 - **Pluggable Backends**: Supports multiple backends
 
+Refer to [tutorials/basic.ipynb](https://github.com/Ascend/TransferQueue/blob/main/tutorial/basic.ipynb) and [tutorials/02_kv_interface.py](https://github.com/Ascend/TransferQueue/blob/main/tutorial/02_kv_interface.py) for detailed usage examples.
+
 #### StreamingDataLoader API
 
 Designed as a drop-in replacement for the standard PyTorch `DataLoader`, this API allows each rank to automatically consume data without single-controller intervention.
@@ -147,17 +149,12 @@ Developers can leverage `TransferQueueClient` directly to implement advanced fea
 #### verl
 The primary motivation for integrating TransferQueue to verl now is to **alleviate the data transfer bottleneck of the single controller `RayPPOTrainer`**. Currently, all `DataProto` objects must be routed through `RayPPOTrainer`, resulting in a single point bottleneck of the whole post-training system.
 
-![verl_dataflow_DataProto](https://github.com/TransferQueue/community_doc/blob/main/docs/verl_workflow.jpeg?raw=true)
-
-Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by
-
-- Replacing `DataProto` with `BatchMeta` (metadata) and `TensorDict` (actual data) structures
-- Preserving verl's original Dispatch/Collect logic via BatchMeta (maintaining single-controller debuggability)
-- Accelerating data transfer by TransferQueue's distributed storage units
+<p align="center">
+  <img src="https://raw.githubusercontent.com/wuxibin89/verl/refs/heads/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">
-  <img src="https://raw.githubusercontent.com/wuxibin89/verl/refs/heads/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">
+  <img src="https://raw.githubusercontent.com/wuxibin89/verl/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">
-  <img src="https://raw.githubusercontent.com/wuxibin89/verl/refs/heads/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">
+  <img src="https://raw.githubusercontent.com/wuxibin89/verl/wuxibin/doc_images/docs/_static/transfer_queue.png" width="100%">
+</p>
 
-![verl_dataflow_TransferQueue](https://github.com/TransferQueue/community_doc/blob/main/docs/verl_workflow_with_tq.jpeg?raw=true)
+Official integration to verl is available at [verl/pulls/5401](https://github.com/verl-project/verl/pull/5401), with design doc at [[RFC] PPOTrainer with TransferQueue Integration](https://github.com/verl-project/verl/issues/5400). You may also refer to our [recipe](https://github.com/Ascend/TransferQueue/blob/main/recipe/simple_use_case/single_controller_demo.py), where we mimic the verl usage in a high-level manner. 
 
-You may refer to the [recipe](https://github.com/Ascend/TransferQueue/tree/dev/recipe/simple_use_case), where we mimic the verl usage in both async & sync scenarios. Official integration to verl is also available now at [verl/pulls/3649](https://github.com/volcengine/verl/pull/3649) (with subsequent PRs to further optimize the integration).
 
 ### Disaggregated Example
 
@@ -216,11 +213,11 @@ pip install TransferQueue
   <img src="https://github.com/TransferQueue/community_doc/blob/main/docs/performance_0.1.1.dev2.png?raw=true" width="100%">
 </p>
 
-> Note: The above benchmark for TransferQueue is based on our naive `SimpleStorage` backend. By introducing high-performance storage backends and optimizing serialization/deserialization, we expect to achieve even better performance. Warmly welcome contributions from the community!
+> Note: Optimization for MooncakeStore and other backends are still in process. Warmly welcome contributions from the community!
 
-For detailed performance benchmarks, please refer to [this blog](https://www.yuque.com/haomingzi-lfse7/hlx5g0/tml8ke0zkgn6roey?singleDoc#).
+For detailed performance benchmarks, please refer to [this blog](https://www.yuque.com/haomingzi-lfse7/lhp4el/tml8ke0zkgn6roey?singleDoc#).
 
-We also provide a [stress test report](https://www.yuque.com/haomingzi-lfse7/hlx5g0/ydbwgo5k2umaag78?singleDoc#) that demonstrates **768 concurrent clients writing 1.4 TB of data** into TransferQueue across 4 nodes. The system remains stable without any crashes or data loss, achieving 80% bandwidth.
+We also provide a [stress test report](https://www.yuque.com/haomingzi-lfse7/lhp4el/mt0vedqy7c337pgg?singleDoc#) that demonstrates more than **8192 concurrent clients writing 2 TB of data** into TransferQueue across 4 nodes. The system remains stable without any crashes or data loss.
 
 <h2 id="customize"> 🛠️ Customize TransferQueue</h2>
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -118,7 +118,7 @@ yuanrong = [
     "openyuanrong-datasystem"
 ]
 mooncake = [
-    "mooncake-transfer-engine"
+    "mooncake-transfer-engine==0.3.10.post1"
-    "mooncake-transfer-engine==0.3.10.post1"
+    "mooncake-transfer-engine>=0.3.10.post1,<0.4"
-    "mooncake-transfer-engine==0.3.10.post1"
+    "mooncake-transfer-engine>=0.3.10.post1,<0.4"
 ]
 
 # If you need to mimic `package_dir={'': '.'}`:

diff --git a/scripts/performance_test/README_PERFTEST.md b/scripts/performance_test/README_PERFTEST.md
@@ -42,19 +42,63 @@ python perftest.py \
 | `--head_node_ip` | Head node IP address | - | Yes |
 | `--worker_node_ip` | Worker node IP address (required for Yuanrong) | None | No |
 | `--output_csv` | Path to output CSV file | None | No |
+| `--use_complex_case` | Use complex test case with nested tensors and NonTensorStack fields | False | No |
 
 ## Backend Configuration
 
 The script reads the backend configuration directly from the provided `--backend_config` YAML file. The backend type is determined by `backend.storage_backend` in the config file. When `--backend` is specified, it overrides the value in the config.
 
-For device support of each backend:
-- `SimpleStorage`: `cpu`
-- `Yuanrong`: `cpu`, `npu`
-- `MooncakeStore`: `cpu`, `gpu`
+### SimpleStorage Configuration
 
-## Test Data Format
+```yaml
+backend:
+  storage_backend: SimpleStorage
+  SimpleStorage:
+    total_storage_size: 100000
+    num_data_storage_units: 16
+```
+
+### Yuanrong Configuration
+
+```yaml
+backend:
+  storage_backend: Yuanrong
+  Yuanrong:
+    port: 31501
+    enable_yr_npu_transport: true
+```
+
+For Yuanrong backend, writer runs on the head node and reader runs on the worker node. `--worker_node_ip` is required.
+
+### MooncakeStore Configuration
+
+```yaml
+backend:
+  storage_backend: MooncakeStore
+  MooncakeStore:
+    auto_init: true
+    metadata_server: localhost:50050
+    master_server_address: localhost:50051
+    local_hostname: ""
+    protocol: rdma
+    global_segment_size: 86294967296
+    local_buffer_size: 86294967296
+    device_name: ""
+```
+
+## Test Scenarios
+
+### Simple Test Case (Default)
+
+When `--use_complex_case` is **not** specified (default), the test creates a `TensorDict` with only regular tensors:
 
-The test case creates a `TensorDict` with three types of fields to simulate real training batches:
+- **Regular tensors**: Shape `(batch_size, seq_length)`, float32.
+
+Each regular tensor field size = `batch_size × seq_length × 4` bytes.
+
+### Complex Test Case
+
+When `--use_complex_case` is specified, the test creates a `TensorDict` with three types of fields to simulate real training batches:
 
 1. **Regular tensors**: Shape `(batch_size, seq_length)`, float32.
 2. **Nested tensors** (non-NPU devices): Variable-length ragged sequences with lengths forming an arithmetic progression from 1 to `seq_length`. Average length ≈ `seq_length / 2`, so each nested field is roughly half the size of a regular field.
@@ -73,10 +117,6 @@ Each iteration performs a PUT → LIST → GET → DELETE cycle via TransferQueu
 
 The test runs `--num_test_iterations` iterations. Data creation only happens in the first iteration; subsequent iterations reuse the same TensorDict to isolate transfer overhead.
 
-## Yuanrong Backend
-
-For Yuanrong backend, writer runs on the head node and reader runs on the worker node. `--worker_node_ip` is required.
-
 ## Running Full Test Suite
 
 The `run_perf_test.sh` script automates the full test suite across all backends and data sizes, then generates a comparison chart:
@@ -130,12 +170,18 @@ After running the tests, `draw_figure.py` reads all CSV files from `results/` an
 
 ## Examples
 
-### SimpleStorage backend
+### SimpleStorage backend (simple case)
 ```bash
 python perftest.py --backend_config=perftest_config.yaml --backend=SimpleStorage \
   --head_node_ip=192.168.0.1
 ```
 
+### SimpleStorage backend (complex case)
+```bash
+python perftest.py --backend_config=perftest_config.yaml --backend=SimpleStorage \
+  --head_node_ip=192.168.0.1 --use_complex_case
+```
+
 ### Yuanrong backend (inter-node)
 ```bash
 python perftest.py --backend_config=perftest_config.yaml --backend=Yuanrong \

diff --git a/transfer_queue/version/version b/transfer_queue/version/version
@@ -1 +1 @@
-0.1.6.dev0
+0.1.6
-0.1.6
+0.1.7.dev
-0.1.6
+0.1.7.dev