[Feature] Implement DMA support by BenkangPeng · Pull Request #293 · tancheng/VectorCGRA

BenkangPeng · 2026-06-02T13:55:27Z

This PR introduces CgraDmaRTL which integrates the CGRA with a DMA engine, enabling direct memory transfers between external DRAM(don't implement now) and the CGRA's dataSPM.

HobbitQia · 2026-06-04T15:05:31Z

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

Rely on data controller
- DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
  
  To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.
- Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.
- Cons: Additional logic is required to feed DMA results into the control memory.
All in controller
- All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
  
  The logic of packeting should also be implemented in the controller module.
- Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).
- Cons: Introduces complex control logic in the controller; results in a slower path.

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

tancheng · 2026-06-05T08:30:27Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

HobbitQia · 2026-06-07T14:56:54Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

tancheng · 2026-06-07T18:15:57Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

HobbitQia · 2026-06-08T02:15:57Z

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

If the DMA data should go through the controller packet path, there may be extra latency of packeting, and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

tancheng · 2026-06-08T04:21:57Z

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

HobbitQia · 2026-06-08T04:41:47Z

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

tancheng · 2026-06-08T05:21:11Z

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

VectorCGRA/controller/ControllerRTL.py

Line 207 in eb71842

    
           s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

HobbitQia · 2026-06-08T06:30:39Z

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

VectorCGRA/controller/ControllerRTL.py

Line 207 in eb71842

s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

But if we add DMA, there should be another path FROM_DMA, which is different from FROM_CPU/FROM_NOC?

tancheng · 2026-06-08T17:12:23Z

But if we add DMA, there should be another path FROM_DMA, which is different from FROM_CPU/FROM_NOC?

Ah, yes, makes sense. That FROM_DMA would be similar to FROM_CPU. The DMA_Controller then can assemble the different ports into our struct and send/recv interface.

HobbitQia · 2026-06-09T01:08:28Z

DMA_Controller

So this DMA_Controller refers to DataSPMController or our DMA engine?

tancheng · 2026-06-09T01:28:47Z

So this DMA_Controller refers to DataSPMController or our DMA engine?

DMA engine in your figure, or @BenkangPeng's DmaEngineRTL.

tancheng · 2026-06-10T06:49:37Z

@BenkangPeng will you update this PR accordingly?

BenkangPeng · 2026-06-10T11:05:01Z

@BenkangPeng will you update this PR accordingly?

Yes, sorry for the delay. I will update this PR as soon as possible.

tancheng · 2026-06-15T05:26:49Z

+
+  return mk_bitstruct(new_name, {
+    'dram_data': DramDataType,
+    'dram_mask': DramMaskType,


explain what is dram_mask with comment?

tancheng · 2026-06-15T05:27:00Z

+    'dram_data': DramDataType,
+    'dram_mask': DramMaskType,
+    'spm_data': SpmDataType,
+    'spm_mask': SpmMaskType,


explain what is spm_mask with comment?

tancheng · 2026-06-15T05:35:15Z

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

Rely on data controller

DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

Cons: Additional logic is required to feed DMA results into the control memory.

All in controller

All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
The logic of packeting should also be implemented in the controller module.

Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

Cons: Introduces complex control logic in the controller; results in a slower path.

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

tancheng · 2026-06-16T15:46:19Z

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

Rely on data controller

DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

Cons: Additional logic is required to feed DMA results into the control memory.

All in controller

All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
The logic of packeting should also be implemented in the controller module.

Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

Cons: Introduces complex control logic in the controller; results in a slower path.

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

@BenkangPeng, as you introduce IntegratedDmaWithCgraRTL, so we should update this diagram that the "Controller", "DataMemController", "SPM", "Control SPM" should be wrapped by a CGRA box. And entire diagram (except the "CPU") should be wrapped by "IntegratedDmaWithCgra" box.

…interface for enhanced data transfer capabilities.

… then drives types from them

…te requests and adjust related signal handling for clarity and consistency.

…TL by passing DmaDataType and DmaCmdType as parameters, and updating related type definitions for improved clarity and consistency.

…rite requests, enhancing type definitions for DmaCmdType and DmaDataType

…Type

…proved clarity

…e of 4

…is an integer multiple of 4

…SendRTL to replace them.

…dyRecv/SendIfcRTL for improved clarity and consistency in DMA signal handling.

… with ValRdyRecv/SendIfcRTL

…ding tests

…rite request type and update corresponding tests for consistency

…arity and consistency by renaming signals related to memory requests and responses.

… AST translation limitations and enforce nbytes % 4 check in construct method instead of update block.

…g detailed comments on mask design and data transfer granularity, clarifying the behavior of dram_mask and spm_mask during DMA operations.

…entation regarding the usage of dma_tag in related files.

…f 'tag'.

…between DMA command and CMD_COMPLETE signals in the same clock cycle, with a reference to related discussion.

…tag' instead of 'tag' in DMA-related messages and tests.

tancheng · 2026-06-27T18:41:23Z

+    # Send the response of reading from SPM to the DMA.
+    s.send_to_dma_spm_rd_resp   = SendIfcRTL(DmaSpmReadRespType)
+
+    # Data memory side of the same SPM access path.


Data memory side of the same SPM access path.

what do you mean by "same SPM"? And should we change "Data memory side" to "SRAM data memory", it is a bit confusing now.

tancheng · 2026-06-27T18:41:55Z

+
+    # Data memory side of the same SPM access path.
+    # Send the request of writing into SPM to the data_mem controller.
+    s.send_to_mem_spm_wr_req   = SendIfcRTL(DmaSpmWriteReqType)


How is this different from above "send_to_mem_store_request"?

IIRC, this is the isolated and newly introduced ifc to distinguish from legacy inter-cgra store_req. We need to refactor their names, e.g.,
s.send_to_mem_store_request -> s.send_to_sram_store_request_from_noc

s.send_to_mem_spm_rd_req -> s.send_to_sram_store_request_from_dma

How does this sound? Similar to the load and store_response. And leave comment that we prefer to separate the two as dma can perform burst data movement.

BenkangPeng requested review from HobbitQia and tancheng June 2, 2026 13:55

tancheng reviewed Jun 2, 2026

View reviewed changes

BenkangPeng force-pushed the dma-cgra branch from f41e7a6 to 86f25a4 Compare June 3, 2026 10:29

BenkangPeng commented Jun 3, 2026

View reviewed changes

Comment thread mem/data/DataMemControllerRTL.py Outdated

tancheng reviewed Jun 3, 2026

View reviewed changes

Comment thread cgra/CgraTemplateRTL.py Outdated

BenkangPeng mentioned this pull request Jun 4, 2026

[CleanUp][NFC] Standardize line endings to LF #294

Merged

BenkangPeng force-pushed the dma-cgra branch from 86f25a4 to fc589c5 Compare June 4, 2026 10:15

tancheng reviewed Jun 5, 2026

View reviewed changes

Comment thread cgra/CgraDmaRTL.py Outdated

tancheng reviewed Jun 13, 2026

View reviewed changes

Comment thread cgra/CgraTemplateRTL.py Outdated

Comment thread cgra/CgraTemplateRTL.py Outdated

Comment thread controller/ControllerRTL.py Outdated

Comment thread controller/ControllerRTL.py Outdated

tancheng reviewed Jun 15, 2026

View reviewed changes

Comment thread controller/ControllerRTL.py Outdated

Comment thread controller/ControllerRTL.py Outdated

tancheng reviewed Jun 15, 2026

View reviewed changes

Comment thread cgra/CgraTemplateRTL.py Outdated

tancheng reviewed Jun 15, 2026

View reviewed changes

HobbitQia reviewed Jun 17, 2026

View reviewed changes

Comment thread controller/ControllerRTL.py

Comment thread mem/dma/DmaEngineRTL.py Outdated

Comment thread mem/data/DataMemControllerRTL.py

BenkangPeng added 28 commits June 27, 2026 21:57

[Feature] Introduce DMA data structure and DMA-to-DRAM write request …

43da86d

…interface for enhanced data transfer capabilities.

[Refactor] Pass DmaCmdType and DmaDataType into DataMemController and…

6e647dd

… then drives types from them

[Refactor] Update DmaEngineRTL to use DmaDramWrReqIfcRTL for DRAM wri…

78a1587

…te requests and adjust related signal handling for clarity and consistency.

[Refactor] Enhance DMA integration in CgraTemplateRTL and ControllerR…

6fb7e50

…TL by passing DmaDataType and DmaCmdType as parameters, and updating related type definitions for improved clarity and consistency.

[Refactor] Update CgraDmaRTL to utilize DmaDramWrReqIfcRTL for DRAM w…

a7618d8

…rite requests, enhancing type definitions for DmaCmdType and DmaDataType

[Fix] Fix the bitwidth mismatch error between DataType and DmaSpmData…

1bf3b79

…Type

[CleanUp] Update DMA attribute references to use new constants for im…

d4ce981

…proved clarity

[Rename][NFC] Rename some variables for clarity

bca3100

Add the assertion to ensure the number of tranfer data is the multipl…

075f63f

…e of 4

Add assertions to ensure that the number of bytes transferred by DMA …

628e2d3

…is an integer multiple of 4

[Refactor] Remove DmaWireIfcRTL and DmaSpmWireIfcRTL. Use ValRdyRecv/…

90023f2

…SendRTL to replace them.

Split the dma_spm_to_dram into 3 signals.

37a363e

Deprecate the DmaSpmMasterRTL in DMA module

af3c0a6

Refactor DataMemControllerRTL to replace DmaSpmMinionIfcRTL with ValR…

0fb1b5a

…dyRecv/SendIfcRTL for improved clarity and consistency in DMA signal handling.

Refactor CgraDmaRTL and CgraTemplateRTL to replace DmaSpmMinionIfcRTL…

0bb2d9c

… with ValRdyRecv/SendIfcRTL

Add CgraDmaRTL wrapper integrating CGRA with DMA engine and correspon…

06eeec4

…ding tests

Refactor CgraDmaRTL to replace DmaDramWrReqIfcRTL with new DMA DRAM w…

33622ea

…rite request type and update corresponding tests for consistency

Refactor DMA signal handling across multiple components to improve cl…

832f701

…arity and consistency by renaming signals related to memory requests and responses.

[Fix] Precompute commonly used values in DmaEngineRTL to avoid PyMTL3…

94ca68a

… AST translation limitations and enforce nbytes % 4 check in construct method instead of update block.

Add Verilog generation functionality for the new wrapper.

282159d

Enhance DMA documentation in messages.py and DmaEngineRTL.py by addin…

b772b7b

…g detailed comments on mask design and data transfer granularity, clarifying the behavior of dram_mask and spm_mask during DMA operations.

[Rename] Rename tag to dma_tag

089e4ba

[Rename] Update references from 'ctrl' to 'controller'. Enhance docum…

0ed4b3c

…entation regarding the usage of dma_tag in related files.

[Fix] Update dma_cmd string representation to use 'dma_tag' instead o…

618d6e1

…f 'tag'.

Add warning comment in ControllerRTL.py regarding potential conflict …

04a2a4f

…between DMA command and CMD_COMPLETE signals in the same clock cycle, with a reference to related discussion.

[Fix] Update DmaEngineRTL to use dma_tag

304ae24

[Fix] Update ControllerRTL and DmaEngineRTL to consistently use 'dma_…

85fcd19

…tag' instead of 'tag' in DMA-related messages and tests.

Refactor DmaEngineRTL to simplify word calculation logic

559c419

BenkangPeng force-pushed the dma-cgra branch from 5331cbd to 559c419 Compare June 27, 2026 13:57

tancheng reviewed Jun 27, 2026

View reviewed changes

Conversation

BenkangPeng commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HobbitQia commented Jun 4, 2026

Uh oh!

tancheng commented Jun 5, 2026

Uh oh!

Uh oh!

HobbitQia commented Jun 7, 2026

Uh oh!

tancheng commented Jun 7, 2026

Uh oh!

HobbitQia commented Jun 8, 2026

Uh oh!

tancheng commented Jun 8, 2026

Uh oh!

HobbitQia commented Jun 8, 2026

Uh oh!

tancheng commented Jun 8, 2026

Uh oh!

HobbitQia commented Jun 8, 2026 • edited by tancheng Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tancheng commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HobbitQia commented Jun 9, 2026

Uh oh!

tancheng commented Jun 9, 2026

Uh oh!

tancheng commented Jun 10, 2026

Uh oh!

BenkangPeng commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tancheng Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

BenkangPeng Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

tancheng Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

BenkangPeng Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

tancheng commented Jun 15, 2026

Uh oh!

tancheng commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tancheng Jun 27, 2026

Choose a reason for hiding this comment

Data memory side of the same SPM access path.

Uh oh!

HobbitQia commented Jun 8, 2026 •

edited by tancheng

Loading

tancheng commented Jun 8, 2026 •

edited

Loading

tancheng Jun 27, 2026 •

edited

Loading