Skip to content

Pcie transfer performance#126

Draft
amd-vserbu wants to merge 21 commits into
Xilinx:devfrom
amd-vserbu:perf/pcie-transfer-performance
Draft

Pcie transfer performance#126
amd-vserbu wants to merge 21 commits into
Xilinx:devfrom
amd-vserbu:perf/pcie-transfer-performance

Conversation

@amd-vserbu

Copy link
Copy Markdown
Collaborator

PCIe transfer performance: registered buffers + two-channel transfers

Branch: perf/pcie-transfer-performancetarget: dev · Status: draft

Summary

This PR reworks the QDMA host↔device data path to get much more bandwidth out of bulk transfers. Two changes do most of the work: a new registered-buffer path that pins and DMA-maps a host buffer once and reuses it for many transfers (instead of paying that cost per transfer), and a placement-aware policy that splits each transfer across both of the V80's PCIe NoC channels so both paths stay busy. v80-smi validate gains the bandwidth-benchmark knobs used to measure all of this.

What changed

  • Driver: new registered-buffer ioctls (register once, transfer many times, auto-cleanup on close), plus per-queue NoC channel selection so the two PCIe masters can be A/B tested.
  • Channel policy: a transfer is split across both channels based on the buffer's device address, so both the host-side ingress and the memory-side egress paths are driven (DDR splits in half; HBM routes by its half-memory boundary).
  • vrtd/libvrtd: buffer open now hands back two queue-pair fds (one per channel) and the client decides how to spread each transfer across them; host buffers use hugepages.
  • v80-smi: validate gains bandwidth modes over two backends (raw SLASH and the stock Xilinx QDMA driver) reporting Read/Write/Total for HBM and DDR, with knobs for channel selection, ring size, iteration/duration, and buffer placement.
  • Docs & packaging: documented the new ABI and validate options; Debian/RPM ship the local libqdma patches.

Earlier commits experimented with large-page transfers and custom libqdma scatter-gather/channel patches; those were dropped. The final path is 4 KiB-only, and the speedup comes from the registered-buffer fast path and the two-channel split.

Results

Sustained one-directional bandwidth with the registered-buffer path and the two-channel split:

Path C2H (device→host) H2C (host→device)
DDR ~23 GB/s ~23 GB/s
HBM ~12 GB/s ~23 GB/s
  • 20 GB/s+ is reached on DDR with a single buffer, 2 threads, 2 queues, in sustained mode, as long as it is large enough (64 MB is sufficient).
  • 2 MB pages were tried but discarded: they were slower than 4 KiB at all sizes.

Still to do

  • Running read and write at the same time almost halves throughput; needs investigation into whether the path is full-duplex.
  • Running HBM and DDR H2C together still caps at ~23 GB/s, with no gain over a single memory (an improvement was expected).
  • HBM C2H remains slow (~12 GB/s) compared to everything else.

Why this is still a draft

The transfer API is not in its final shape. Today it requires userspace to use threads to keep multiple transfers in flight. Before merging, we want to move back to plain read/write calls so multiple transfers can be submitted at once without threading.

…for RHEL 9.8

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…e /tmp

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…scriptor size

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
… knobs

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…ansfers

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…olicy

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…-only

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant