[Feature] Implement DMA support by BenkangPeng · Pull Request #293 · tancheng/VectorCGRA

BenkangPeng · 2026-06-02T13:55:27Z

This PR introduces CgraDmaRTL which integrates the CGRA with a DMA engine, enabling direct memory transfers between external DRAM(don't implement now) and the CGRA's dataSPM.

tancheng · 2026-06-02T21:31:46Z

+    s.mem_rd_req_val    = OutPort() # dma_read_request_valid
+    s.mem_rd_req_rdy    = InPort() # dma_read_request_ready
+    s.mem_rd_req_addr   = OutPort(DmaDramAddrType)


Why can't we use the RecvIfcRTL and SendIfcRTL interfaces to connect the DmaRTL?

tancheng · 2026-06-03T18:58:19Z

+      s.data_mem.spm_dma_rval       //= s.spm_dma_rval
+      s.data_mem.spm_dma_rrdy       //= s.spm_dma_rrdy
+      s.data_mem.spm_dma_raddr      //= s.spm_dma_raddr
+      s.data_mem.spm_dma_rresp_val  //= s.spm_dma_rresp_val
+      s.data_mem.spm_dma_rresp_rdy  //= s.spm_dma_rresp_rdy
+      s.data_mem.spm_dma_rresp_data //= s.spm_dma_rresp_data


As we discussed before, should we connect the dma to the controller (as intermediate interface/transition)? instead of directly connecting to data spm?

So then we can leverage

VectorCGRA/controller/ControllerRTL.py

Lines 138 to 139 in 44618d5

s.recv_from_cpu_pkt //= s.recv_from_cpu_pkt_queue.recv

s.send_to_cpu_pkt //= s.send_to_cpu_pkt_queue.send

to decide the next location (e.g., local spam bank, remote spm)?

…ests.

…RecvIfcRTL. Replace `mem` with `dram` for clarity.

HobbitQia · 2026-06-04T15:05:31Z

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

Rely on data controller
- DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
  
  To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.
- Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.
- Cons: Additional logic is required to feed DMA results into the control memory.
All in controller
- All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
  
  The logic of packeting should also be implemented in the controller module.
- Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).
- Cons: Introduces complex control logic in the controller; results in a slower path.

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

tancheng · 2026-06-05T08:30:27Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

tancheng · 2026-06-05T08:39:52Z

+    s.dma_cmd_val       //= s.dma.dma_cmd_val
+    s.dma_cmd_rdy       //= s.dma.dma_cmd_rdy
+    s.dma_cmd_opcode    //= s.dma.dma_cmd_opcode
+    s.dma_cmd_dram_addr //= s.dma.dma_cmd_dram_addr
+    s.dma_cmd_spm_addr  //= s.dma.dma_cmd_spm_addr
+    s.dma_cmd_bytes     //= s.dma.dma_cmd_bytes
+    s.dma_cmd_tag       //= s.dma.dma_cmd_tag
+
+    s.dma_done_val      //= s.dma.dma_done_val
+    s.dma_done_rdy      //= s.dma.dma_done_rdy
+    s.dma_done_tag      //= s.dma.dma_done_tag
+
+    s.dram_rd_req       //= s.dma.dram_rd_req
+    s.dram_rd_resp      //= s.dma.dram_rd_resp
+
+    s.dram_wr_req_val    //= s.dma.dram_wr_req_val
+    s.dram_wr_req_rdy    //= s.dma.dram_wr_req_rdy
+    s.dram_wr_req_addr   //= s.dma.dram_wr_req_addr
+    s.dram_wr_req_data   //= s.dma.dram_wr_req_data
+    s.dram_wr_req_mask   //= s.dma.dram_wr_req_mask
+
+    s.dram_wr_resp_val   //= s.dma.dram_wr_resp_val
+    s.dram_wr_resp_rdy   //= s.dma.dram_wr_resp_rdy
+
+    # DMA to SPM connections.
+
+    s.dma.spm_dma_wval       //= s.cgra.spm_dma_wval
+    s.dma.spm_dma_wrdy       //= s.cgra.spm_dma_wrdy
+    s.dma.spm_dma_waddr      //= s.cgra.spm_dma_waddr
+    s.dma.spm_dma_wdata      //= s.cgra.spm_dma_wdata
+    s.dma.spm_dma_wmask      //= s.cgra.spm_dma_wmask
+
+    s.dma.spm_dma_rval       //= s.cgra.spm_dma_rval
+    s.dma.spm_dma_rrdy       //= s.cgra.spm_dma_rrdy
+    s.dma.spm_dma_raddr      //= s.cgra.spm_dma_raddr
+    s.dma.spm_dma_rresp_val  //= s.cgra.spm_dma_rresp_val
+    s.dma.spm_dma_rresp_rdy  //= s.cgra.spm_dma_rresp_rdy
+    s.dma.spm_dma_rresp_data //= s.cgra.spm_dma_rresp_data


All these will change if we go with @HobbitQia's option 2, right?

Moreover, can we use send/recv interfaces, and define msg (https://github.com/tancheng/VectorCGRA/blob/master/lib/messages.py) to encapsulate data, addr, or whatever needed as struct? So we don't need to declare so many ports, and explicitly connect each of them. This CGRA RTL shouldn't see these details, the data struct can be decomposed inside the submodule.

All these will change if we go with @HobbitQia's option 2, right?

Moreover, can we use send/recv interfaces, and define msg (https://github.com/tancheng/VectorCGRA/blob/master/lib/messages.py) to encapsulate data, addr, or whatever needed as struct? So we don't need to declare so many ports, and explicitly connect each of them. This CGRA RTL shouldn't see these details, the data struct can be decomposed inside the submodule.

Yes.

I agree with we can use this helper functions or classes to define these ports. Initially I design the interface only from the side about how to connect the DMA engine and the chipyard so I defined these ports. But I believe these input/output ports can be used in our struct and wrapped by another adapter so that we can connect them. @BenkangPeng

HobbitQia · 2026-06-07T14:56:54Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

tancheng · 2026-06-07T18:15:57Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

HobbitQia · 2026-06-08T02:15:57Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

If the DMA data should go through the controller packet path, there may be extra latency of packeting, and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

BenkangPeng requested review from HobbitQia and tancheng June 2, 2026 13:55

tancheng reviewed Jun 2, 2026

View reviewed changes

BenkangPeng force-pushed the dma-cgra branch from f41e7a6 to 86f25a4 Compare June 3, 2026 10:29

BenkangPeng commented Jun 3, 2026

View reviewed changes

Comment thread mem/data/DataMemControllerRTL.py Outdated

tancheng reviewed Jun 3, 2026

View reviewed changes

BenkangPeng mentioned this pull request Jun 4, 2026

[CleanUp][NFC] Standardize line endings to LF #294

Merged

BenkangPeng added 13 commits June 4, 2026 17:53

Add the DmaEngine implementation and the test.

308a213

[Test] Update the test of DmaEngine.

90d9eef

Add DMA support to DataMemControllerRTL and implement corresponding t…

5e615e1

…ests.

Add the dma ports into CgraTemplateRTL

30bdc36

Wrap the Cgra and Dma into one single module.

e6c0b3b

[Script] Add the local_CI script file

c3f3dc4

Update .gitignore to ignore the log file

046c860

[Test] Add the test for CgraDmaRTL

4359f1f

[Fix] Fix the bit mismatch error between dma_idx and num_xbar_in_ports.

aff3a8a

[Doc] Add some comments

b2e41e8

[Fix] Fix the bit mismatch by type convertion

5fc388c

Move some constant into common header file

70ae3da

[Refactor] Wrap the signals between dma and dram with SendIfcRTL and …

fc589c5

…RecvIfcRTL. Replace `mem` with `dram` for clarity.

BenkangPeng force-pushed the dma-cgra branch from 86f25a4 to fc589c5 Compare June 4, 2026 10:15

tancheng reviewed Jun 5, 2026

View reviewed changes

	s.recv_from_cpu_pkt //= s.recv_from_cpu_pkt_queue.recv
	s.send_to_cpu_pkt //= s.send_to_cpu_pkt_queue.send

Conversation

BenkangPeng commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tancheng Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

BenkangPeng Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tancheng Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

HobbitQia commented Jun 4, 2026

Uh oh!

tancheng commented Jun 5, 2026

Uh oh!

tancheng Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

HobbitQia Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

HobbitQia commented Jun 7, 2026

Uh oh!

tancheng commented Jun 7, 2026

Uh oh!

HobbitQia commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants