Pcie transfer performance by amd-vserbu · Pull Request #126 · Xilinx/SLASH

amd-vserbu · 2026-06-15T16:04:14Z

PCIe transfer performance: registered buffers + two-channel transfers

Branch: perf/pcie-transfer-performance → target: dev · Status: draft

Summary

This PR reworks the QDMA host↔device data path to get much more bandwidth out of bulk transfers. Two changes do most of the work: a new registered-buffer path that pins and DMA-maps a host buffer once and reuses it for many transfers (instead of paying that cost per transfer), and a placement-aware policy that splits each transfer across both of the V80's PCIe NoC channels so both paths stay busy. v80-smi validate gains the bandwidth-benchmark knobs used to measure all of this.

What changed

Driver: new registered-buffer ioctls (register once, transfer many times, auto-cleanup on close), plus per-queue NoC channel selection so the two PCIe masters can be A/B tested.
Channel policy: a transfer is split across both channels based on the buffer's device address, so both the host-side ingress and the memory-side egress paths are driven (DDR splits in half; HBM routes by its half-memory boundary).
vrtd/libvrtd: buffer open now hands back two queue-pair fds (one per channel) and the client decides how to spread each transfer across them; host buffers use hugepages.
v80-smi: validate gains bandwidth modes over two backends (raw SLASH and the stock Xilinx QDMA driver) reporting Read/Write/Total for HBM and DDR, with knobs for channel selection, ring size, iteration/duration, and buffer placement.
Docs & packaging: documented the new ABI and validate options; Debian/RPM ship the local libqdma patches.

Earlier commits experimented with large-page transfers and custom libqdma scatter-gather/channel patches; those were dropped. The final path is 4 KiB-only, and the speedup comes from the registered-buffer fast path and the two-channel split.

Results

Sustained one-directional bandwidth with the registered-buffer path and the two-channel split:

Path	C2H (device→host)	H2C (host→device)
DDR	~23 GB/s	~23 GB/s
HBM	~12 GB/s	~23 GB/s

20 GB/s+ is reached on DDR with a single buffer, 2 threads, 2 queues, in sustained mode, as long as it is large enough (64 MB is sufficient).
2 MB pages were tried but discarded: they were slower than 4 KiB at all sizes.

Still to do

Running read and write at the same time almost halves throughput; needs investigation into whether the path is full-duplex.
Running HBM and DDR H2C together still caps at ~23 GB/s, with no gain over a single memory (an improvement was expected).
HBM C2H remains slow (~12 GB/s) compared to everything else.

Why this is still a draft

The transfer API is not in its final shape. Today it requires userspace to use threads to keep multiple transfers in flight. Before merging, we want to move back to plain read/write calls so multiple transfers can be submitted at once without threading.

…for RHEL 9.8 Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

…e /tmp Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

…scriptor size Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

… knobs Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

…ansfers Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

…olicy Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

…-only Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

amd-vserbu added 21 commits June 15, 2026 13:42

Added slash_compat.h force include and fix from_timer backport issue …

2b26c0a

…for RHEL 9.8 Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

driver: large-page qdma transfers, NoC channel steering, libqdma patches

52e1535

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

libvrtd: hugepage host buffers and granule-aware partial sync

139207e

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

smi: validate bandwidth modes with raw SLASH and stock qdma backends

5c71053

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

packaging: ship libqdma patches, add deps, harden install test, ignor…

eb1ef82

…e /tmp Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

driver: guard libqdma pr_fmt under force-included compat header

6cf73e8

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

driver: verify qdma host-profile readback and add tunable hugepage de…

81b67b6

…scriptor size Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

smi: add validate buffer placement, channel allocation, and bandwidth…

cc99940

… knobs Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

docs: document new validate placement and bandwidth options

f9abff3

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

Added 4k|2M explicit page specification

356bbd3

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

driver: add qdma registered-buffer abi with pinned, pre-dma-mapped tr…

96ae870

…ansfers Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

libslash: add qdma buffer register/unregister/transfer wrappers and mock

70046dc

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

tests: cover qdma registered-buffer kselftest abi

64a16c9

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

docs: document qdma registered-buffer abi

c45ceb4

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

smi: add --ring-size-index and use registered buffers for raw transfers

9188e32

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

docs: document validate --ring-size-index option

23ce484

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

vrtd: plumb mm-channel selection through buffer open

d111d40

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

driver+libslash: added transfer performance hint

8d1f4df

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

vrt/vrtd: use new performance buffer ioctl api

5208e0a

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

qdma stack: change policy from dual-channel to the more complex v80 p…

7664f5f

…olicy Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

driver+vrt+smi: drop libqdma sg/channel patches, make transfers 4 KiB…

4878ca9

…-only Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pcie transfer performance#126

Pcie transfer performance#126
amd-vserbu wants to merge 21 commits into
Xilinx:devfrom
amd-vserbu:perf/pcie-transfer-performance

amd-vserbu commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amd-vserbu commented Jun 15, 2026

PCIe transfer performance: registered buffers + two-channel transfers

Summary

What changed

Results

Still to do

Why this is still a draft

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant