Skip to content

[Feature] Implement DMA support#293

Open
BenkangPeng wants to merge 45 commits into
tancheng:masterfrom
BenkangPeng:dma-cgra
Open

[Feature] Implement DMA support#293
BenkangPeng wants to merge 45 commits into
tancheng:masterfrom
BenkangPeng:dma-cgra

Conversation

@BenkangPeng

Copy link
Copy Markdown
Collaborator

Related issue: coredac/CGRA-SoC#2

This PR introduces CgraDmaRTL which integrates the CGRA with a DMA engine, enabling direct memory transfers between external DRAM(don't implement now) and the CGRA's dataSPM.

@BenkangPeng BenkangPeng requested review from HobbitQia and tancheng June 2, 2026 13:55
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py
Comment thread mem/dma/DmaEngineRTL.py
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread cgra/CgraDmaRTL.py Outdated
Comment thread mem/data/DataMemControllerRTL.py Outdated
Comment thread cgra/CgraTemplateRTL.py Outdated
@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.

      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

    • Cons: Additional logic is required to feed DMA results into the control memory.

    image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.

      The logic of packeting should also be implemented in the controller module.

    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

    • Cons: Introduces complex control logic in the controller; results in a slower path.

    image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

@tancheng

tancheng commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

Comment thread cgra/CgraDmaRTL.py Outdated
@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

@tancheng

tancheng commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

@HobbitQia

Copy link
Copy Markdown
Collaborator

Hi @HobbitQia, option 2 looks good to me. Though I am not sure what logic should be additionally in DataMemController?

I am thinking if we want enable the concurrent running of DMA and traditional load/store, then we need to multiplex the port of Data SPM and I think this logic can be implemented in DataMemController. However, we can also transform the data from DRAM into packets and use the command like CMD_STORE_REQUEST or CMD_LOAD_REQUEST to send to SPM. Then we can entirely write our logic in controller but with higher latency. I think maybe the former method can have better performance with minimal addition to DataMemController.

I thought they are the same latency if we can distinguish the CMD_STORE_REQUEST into CMD_STORE_REQUEST_FROM_NOC and CMD_STORE_REQUEST_FROM_CPU (and add another inport on the xbar) in controller?

Adding logic inside the DataMemController kind of bypassing the CGRA controller, which doesn't align with your Option 2, WDYT?

If the DMA data should go through the controller packet path, there may be extra latency of packeting, and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

@HobbitQia

Copy link
Copy Markdown
Collaborator

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

@HobbitQia

HobbitQia commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

there may be extra latency of packeting

Packing can just be combinational logic before putting into a queue.

and there may be competitions between NoC/CPU/tile request to SPM? Or we have two separate paths in Controller?

I was thinking about separate paths, so mentioned FROM_CPU and FROM_NOC. I feel the requests targeting same SPM bank should anyway conflict with each other, WDYT?

Got it. I mean the conflict between different paths. Even we have FROM_CPU and FROM_NOC, there may be concurrent read/write on SPMs. So maybe we need to multiplex the in/out ports of SPMs and the logic of multiplexing and handling the conflicts can be implemented in DataSPMController?

Oh, we don't need to distinguish FROM_CPU and FROM_NOC, we can decompose the requests from CPU like:

s.recv_from_cpu_pkt_queue.send.rdy @= s.crossbar.recv[kFromCpuCtrlAndDataIdx].rdy

But if we add DMA, there should be another path FROM_DMA, which is different from FROM_CPU/FROM_NOC?

@tancheng

tancheng commented Jun 8, 2026

Copy link
Copy Markdown
Owner

But if we add DMA, there should be another path FROM_DMA, which is different from FROM_CPU/FROM_NOC?

Ah, yes, makes sense. That FROM_DMA would be similar to FROM_CPU. The DMA_Controller then can assemble the different ports into our struct and send/recv interface.

@HobbitQia

Copy link
Copy Markdown
Collaborator

DMA_Controller

So this DMA_Controller refers to DataSPMController or our DMA engine?

@tancheng

tancheng commented Jun 9, 2026

Copy link
Copy Markdown
Owner

So this DMA_Controller refers to DataSPMController or our DMA engine?

DMA engine in your figure, or @BenkangPeng's DmaEngineRTL.

@tancheng

Copy link
Copy Markdown
Owner

@BenkangPeng will you update this PR accordingly?

@BenkangPeng

Copy link
Copy Markdown
Collaborator Author

@BenkangPeng will you update this PR accordingly?

Yes, sorry for the delay. I will update this PR as soon as possible.

Comment thread cgra/CgraTemplateRTL.py Outdated
Comment thread cgra/CgraTemplateRTL.py Outdated
Comment thread controller/ControllerRTL.py Outdated
Comment thread controller/ControllerRTL.py Outdated
Comment thread controller/ControllerRTL.py Outdated
Comment thread controller/ControllerRTL.py Outdated
Comment thread cgra/CgraTemplateRTL.py Outdated
Comment thread cgra/CgraDmaRTL.py Outdated
Comment thread cgra/CgraDmaRTL.py Outdated
Comment thread cgra/IntegratedCgraWithDmaRTL.py
Comment thread lib/basic/val_rdy/ifcs.py Outdated
Comment thread lib/basic/val_rdy/ifcs.py Outdated
Comment thread lib/basic/val_rdy/ifcs.py Outdated
Comment thread lib/util/common.py Outdated
Comment thread lib/messages.py

return mk_bitstruct(new_name, {
'dram_data': DramDataType,
'dram_mask': DramMaskType,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what is dram_mask with comment?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment thread lib/messages.py
'dram_data': DramDataType,
'dram_mask': DramMaskType,
'spm_data': SpmDataType,
'spm_mask': SpmMaskType,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what is spm_mask with comment?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@tancheng

Copy link
Copy Markdown
Owner

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.

    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.

    • Cons: Additional logic is required to feed DMA results into the control memory.

      image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
      The logic of packeting should also be implemented in the controller module.

    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).

    • Cons: Introduces complex control logic in the controller; results in a slower path.

      image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

@tancheng

Copy link
Copy Markdown
Owner

Hi @tancheng @BenkangPeng , I summarized two direction of DMA design as below:

  • Rely on data controller

    • DMA is added as a new client of the DataMemControllerRTL, where data in the DMA engine communicates directly with the DataMemControllerRTL, and the logic for multiplexing SPM ports is also implemented in that module.
      To initiate DMA, the CPU can send dma_mvin or dma_mvout to the CGRA, after which the controller activates the DMA engine by sending start signals.
    • Pros: Keeps the controller clean; provides a faster path because data does not go through the controller.
    • Cons: Additional logic is required to feed DMA results into the control memory.
      image
  • All in controller

    • All decoding logic is handled in the controller. The logic for handling the logic port should still reside in the DataMemControllerRTL, since the data memory should have its own port multiplexing logic.
      The logic of packeting should also be implemented in the controller module.
    • Pros: Unifies control and data memory within the controller (the controller is already connected to both control and data memory).
    • Cons: Introduces complex control logic in the controller; results in a slower path.
      image

I prefer the second method but I think there are still some logic should be written in DataMemControllerRTL. WDTY?

We should also include the figure 2 into our

@BenkangPeng, as you introduce IntegratedDmaWithCgraRTL, so we should update this diagram that the "Controller", "DataMemController", "SPM", "Control SPM" should be wrapped by a CGRA box. And entire diagram (except the "CPU") should be wrapped by "IntegratedDmaWithCgra" box.

Comment thread controller/ControllerRTL.py
Comment thread mem/dma/DmaEngineRTL.py Outdated
Comment thread mem/data/DataMemControllerRTL.py
…interface for enhanced data transfer capabilities.
…te requests and adjust related signal handling for clarity and consistency.
…TL by passing DmaDataType and DmaCmdType as parameters, and updating related type definitions for improved clarity and consistency.
…rite requests, enhancing type definitions for DmaCmdType and DmaDataType
…dyRecv/SendIfcRTL for improved clarity and consistency in DMA signal handling.
…rite request type and update corresponding tests for consistency
…arity and consistency by renaming signals related to memory requests and responses.
… AST translation limitations and enforce nbytes % 4 check in construct method instead of update block.
…g detailed comments on mask design and data transfer granularity, clarifying the behavior of dram_mask and spm_mask during DMA operations.
…entation regarding the usage of dma_tag in related files.
…between DMA command and CMD_COMPLETE signals in the same clock cycle, with a reference to related discussion.
…tag' instead of 'tag' in DMA-related messages and tests.
# Send the response of reading from SPM to the DMA.
s.send_to_dma_spm_rd_resp = SendIfcRTL(DmaSpmReadRespType)

# Data memory side of the same SPM access path.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data memory side of the same SPM access path.

what do you mean by "same SPM"? And should we change "Data memory side" to "SRAM data memory", it is a bit confusing now.


# Data memory side of the same SPM access path.
# Send the request of writing into SPM to the data_mem controller.
s.send_to_mem_spm_wr_req = SendIfcRTL(DmaSpmWriteReqType)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from above "send_to_mem_store_request"?

@tancheng tancheng Jun 27, 2026

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this is the isolated and newly introduced ifc to distinguish from legacy inter-cgra store_req. We need to refactor their names, e.g.,
s.send_to_mem_store_request -> s.send_to_sram_store_request_from_noc

s.send_to_mem_spm_rd_req -> s.send_to_sram_store_request_from_dma

How does this sound? Similar to the load and store_response. And leave comment that we prefer to separate the two as dma can perform burst data movement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants