Homepage ꄲ Technical Sharing ꄲ Test Performance of GPU Direct DMA RDMA and FPGA Communication on Jetson Platform

Test Performance of GPU Direct DMA RDMA and FPGA Communication on Jetson Platform

Created on：2026-07-05 15:34

1 Technical Principles

1.1 What is GPU Direct RDMA

GPU Direct RDMA (Remote Direct Memory Access) is a high-performance data transfer technology provided by NVIDIA that allows third-party PCIe devices (such as FPGAs, network cards, video capture cards) to bypass the CPU and system memory, and directly exchange data with GPU video memory through the PCIe bus.

Traditional data path (FPGA → GPU):

FPGA ──DMA──→ CPU Memory
         ↑
      PCIe Transfer (1st time)
                ──cudaMemcpy──→ GPU Video Memory
                                       ↑
                                  Memory Bus Copy (2nd time)

GPU Direct RDMA path (FPGA → GPU):

FPGA ──PCIe DMA──→ GPU Video Memory
         ↑
    Only one PCIe transfer, zero CPU copy

1.2 Core Advantages

Advantage	Description
Reduced Latency	Eliminates CPU intermediate copy, reducing end-to-end latency by approximately 50%
Bandwidth Improvement	Avoids memory bus contention, increasing effective bandwidth by 1.5x~2.5x
CPU Offloading	CPU is completely idle during DMA transfer and can handle other tasks
Zero Copy	Data is directly transmitted to GPU without system memory transit buffering
Deterministic Latency	No CPU scheduling interference, suitable for real-time systems

1.3 Adapted Devices and Platforms

Platform	Support Status	GPU Type	Remarks
NVIDIA Jetson Orin (Tegra)	✅ Verified	Integrated GPU (Unified Memory)	Via nvidia-p2p kernel interface
NVIDIA Jetson Xavier	✅ Adaptable	Integrated GPU	Same architecture as Orin
x86 + NVIDIA Discrete GPU	✅ Adaptable	Tesla/Quadro	Requires nvidia-peermem module
FPGA (Xilinx Kintex/Artix)	✅ Verified	N/A	Via XDMA IP core + custom driver
Other PCIe DMA Devices	✅ Extensible	-	Need to implement Pin/Unpin/Transfer ioctl

1.4 Workflow

GPU buffer allocation (cudaHostAlloc / cudaMalloc)
Pin operation: Map GPU virtual address to physical pages and lock in memory
DMA transfer: FPGA directly reads/writes GPU physical pages (via PCIe BAR)
Unpin operation: Release page lock

Key point: The Pin operation only needs to be executed once, and the same Handle can be reused infinitely for subsequent DMA transfers, avoiding the address translation overhead required for each transfer in the traditional method.

2 Test Environment

2.1 Hardware Platform

Xingce Electronics JetKU hardware, PCIe Gen3x4 between FPGA and Jetson NX, FPGA with 2GB cache, using Xilinx XDMA IP

Component	Specification
Embedded Platform	NVIDIA Jetson Orin (aarch64)
GPU	Orin Integrated Ampere GPU, Unified Memory Architecture
FPGA	Xilinx Series, PCIe Gen3 x4
FPGA-side Memory	DDR4 2GB
PCIe Link	Gen3 x4 (Theoretical peak ~4GB/s)
System Memory	LPDDR5 (Unified Memory)

2.2 Software Environment

Component	Version
OS	Ubuntu 22.04 (aarch64)
CUDA	12.6
Kernel Driver	HelloFPGA XDMA Custom Driver (v2020.2.2) + GPU Direct Extension
User-space Library	libHelloFPGACore.so (including GPU Direct compatibility layer)
Compiler	nvcc (CUDA 12.6) + GCC

2.3 Driver Architecture

User Space:  HelloFPGACore.so (TransferMode API)
               │
               ├─ CPU Mode: open(/dev/HelloFPGA0_c2h_*) → read/write
               │
               └─ GPU Direct Mode: open → ioctl(XDMA_IOC_GPU_PIN/XFER/UNPIN)
                        │
Kernel Space:  HelloFPGA.ko (XDMA + xdma_gpu_direct + xdma_gpu_tegra)
                        │
Hardware:      FPGA XDMA IP ←──PCIe──→ GPU BAR (Physical Address Direct Access)

3 Test Methods

3.1 Test Tools

Test program: gpu_direct_api_test.cu

Supports two running modes:

Quick functional test: Verify API correctness + performance comparison (about 2 minutes)
Long-term steady-state stress test: Continuous operation for 12 hours, recording data every 10 minutes (--long parameter)

3.2 Test Comparison Scheme

Path ID	Scheme Name	Data Flow	Description
[A]	CPU DMA Only	FPGA → CPU Memory	Traditional DMA, data stays on CPU side
[B]	FPGA→CPU→GPU Full Path	FPGA → CPU → GPU	Complete path for delivering data to GPU in traditional way
[C]	GPU Direct Handle	FPGA → GPU Direct	Pre-Pin + DMA direct transfer, no CPU transit

3.3 Test Items

Test Item	Content
API Function Verification	GetStatus / Pin / ReadC2H / WriteH2C / Unpin
Data Correctness	Write pattern → Read back → Byte-by-byte comparison
Multi-size Performance	Full coverage of 6 sizes from 4KB to 8MB
2GB Address Space	Traverse the entire 0~2GB range of FPGA to verify no address dead spots
Multi-buffer Rotation	4-frame GPU buffer cyclic acquisition, compared with single buffer
TransferMode Compatibility	Zero modification to old interfaces, internal automatic routing to GPU Direct
12-hour Stability	72 samples, full recording of power consumption / performance / jitter

3.4 Key Parameters

Transfer block size: 4MB (long test) / 4KB~8MB (quick test)
FPGA address step: 64MB (traverse 0~2GB, 32 test points / round)
Sampling frequency: every 10 minutes (long test)
Iterations per sampling point: average of 5 times
Power consumption collection: INA3221 sensor (VDD_IN channel)

4 Test Results

4.1 Multi-size Performance Comparison (Quick Test)

Latency comparison (μs, lower is better), table format: Read / Write

Data Size	CPU DMA Only	FPGA→CPU→GPU Full Path	GPU Direct	Speedup Ratio (vs CPU)	Speedup Ratio (vs Full Path)
4KB	82.6 / 58.4	232.8 / 86.2	54.4 / 45.4	1.52x / 1.29x	4.28x / 1.90x
64KB	87.8 / 80.4	122.4 / 195.2	70.2 / 74.4	1.25x / 1.08x	1.74x / 2.62x
512KB	341.8 / 301.6	635.2 / 740.2	292.0 / 243.4	1.17x / 1.24x	2.18x / 3.04x
1MB	638.4 / 574.6	1016.0 / 1219.2	521.2 / 412.6	1.22x / 1.39x	1.95x / 2.95x
4MB	2398.0 / 2102.8	3417.6 / 3522.2	1773.6 / 1487.0	1.35x / 1.41x	1.93x / 2.37x
8MB	4888.2 / 4250.0	6564.8 / 6507.6	3557.4 / 2951.2	1.37x / 1.44x	1.85x / 2.21x

Bandwidth comparison (GB/s, higher is better), table format: Read / Write

Data Size	CPU DMA Only	FPGA→CPU→GPU	GPU Direct
512KB	1.53 / 1.74	0.83 / 0.71	1.80 / 2.15
1MB	1.64 / 1.82	1.03 / 0.86	2.01 / 2.54
4MB	1.75 / 1.99	1.23 / 1.19	2.36 / 2.82
8MB	1.72 / 1.97	1.28 / 1.29	2.36 / 2.84

GPU Direct peak bandwidth reaches 2.84 GB/s, approaching the theoretical limit of PCIe Gen3 x4

4.2 TransferMode Compatibility Mode Performance

Zero modification to user code (still calling HelloFPGA_DMA_MM_*), switch mode only through 2 lines of configuration; Unit: μs, Format: Read / Write

Data Size	CPU Mode	GPU_PINNED Mode	Speedup Ratio
4KB	79.6 / 65.4	57.8 / 51.8	1.38x / 1.26x
64KB	140.4 / 123.2	69.8 / 75.0	2.01x / 1.64x
256KB	198.2 / 210.6	141.4 / 155.2	1.40x / 1.36x
1MB	607.0 / 677.8	417.2 / 487.6	1.45x / 1.39x
4MB	2183.2 / 2516.2	1491.8 / 1781.4	1.46x / 1.41x
8MB	4279.4 / 4977.4	2954.4 / 3541.0	1.45x / 1.41x

4.3 Multi-buffer vs Single-buffer Comparison

Simulate actual image acquisition scenario: 4 GPU buffers rotation vs single buffer repeated read/write (100 iterations, 1MB/frame)

Indicator	Multi-buffer (4-frame rotation)	Single-buffer	Difference
Average Latency	501.7 μs	491.9 μs	-
Minimum Latency	460.0 μs	455.0 μs	-
Maximum Latency	1124.0 μs	660.0 μs	-
Average Bandwidth	2.09 GB/s	2.13 GB/s	-
Equivalent Frame Rate	1993 fps	2033 fps	0.98x

Conclusion: The overhead of multi-buffer address lookup and matching is negligible and does not affect performance.

4.4 12-hour Long-term Steady-state Test

Test duration: 12.00 hours; Number of sampling points: 72 times (every 10 minutes); Transfer block size: 4MB; Address traversal: 0 ~ 2GB (completed 2 full rounds of traversal)

Performance Statistics

Indicator	CPU DMA Only	FPGA→CPU→GPU	GPU Direct
Average Read (μs)	2711.9	3518.3	1785.1
Average Write (μs)	2276.2	3701.2	1489.0
Average Bandwidth Read	1.55 GB/s	1.19 GB/s	2.35 GB/s
Average Bandwidth Write	1.84 GB/s	1.13 GB/s	2.82 GB/s

Speedup Ratio (Baseline: Traditional Path)

Comparison Baseline	Read	Write
GPU Direct vs CPU DMA	1.52x	1.53x
GPU Direct vs Traditional Full Path	1.97x	2.49x

GPU Direct Performance Stability

Indicator	GPU Direct Read	GPU Direct Write
Minimum Latency	1749.6 μs	1461.2 μs
Maximum Latency	2008.0 μs	1554.6 μs
Jitter (max-min)	258.4 μs	93.4 μs
Relative Fluctuation	±7.2%	±3.1%

Performance Trend by Time Period (No Performance Degradation)

Time Period	GPU Read Average	GPU Write Average	Power Consumption
0 ~ 4h	1782 μs	1483 μs	18.38 W
4 ~ 8h	1784 μs	1490 μs	18.65 W
8 ~ 12h	1786 μs	1488 μs	18.78 W

Power Consumption Statistics

Indicator	Value
Average Power Consumption	18.57 W
Minimum Power Consumption	18.23 W
Maximum Power Consumption	18.91 W
Fluctuation Range	0.69 W (±1.8%)

Address Consistency Conclusion: There is no significant difference in GPU Direct transfer latency across the full 2GB address range of FPGA (0MB ~ 1984MB, 64MB step) (standard deviation < 15μs), and the performance of each area of FPGA DDR is balanced.

5 Conclusions

5.1 Performance Conclusions

Compared with the traditional FPGA→CPU→GPU full path, GPU Direct accelerates Read by 1.97 times and Write by 2.49 times; the core benefit comes from eliminating the cudaMemcpy intermediate copy;
It still improves by about 1.5 times compared with the CPU-only DMA scheme, and the zero-copy architecture has a natural advantage;
The measured peak bandwidth is 2.84GB/s, close to the theoretical throughput upper limit of PCIe Gen3 x4 hardware.

5.2 Stability Conclusions

No performance degradation throughout the 12-hour long test, and the latency difference between the first and last 4 hours is less than 0.3%;
No memory leaks, and the Pin/Unpin resource release logic is normal;
Power consumption is stable without overheating and frequency reduction, and there are no performance hotspots in all DDR addresses of FPGA.

5.3 Compatibility Conclusions

Fully backward compatible with the original upper-layer business code, no modification required;
Extremely low access cost, only two new lines of configuration code are needed to switch to GPU direct transfer mode;
Multi-frame image buffer rotation has almost no performance loss, adapting to machine vision stream processing scenarios.

5.4 Recommended Scenarios

Scenario	Recommended Scheme	Expected Speedup
FPGA Image Acquisition → GPU Inference	GPU Direct (Multi-buffer)	2.0x
FPGA Signal Processing → GPU Computing	GPU Direct (Single-buffer)	1.5x~2.0x
FPGA ↔ CPU Pure Memory Interaction (No GPU)	Traditional CPU DMA	No need to switch
Small Data Packets <4KB	Traditional CPU DMA	Insignificant improvement

5.5 Recommendations and Limitations

Optimal transfer block ≥64KB: The overhead of small packet DMA establishment accounts for a high proportion, and the gain of GPU Direct is limited;
Pin operation is only executed once during initialization; it is forbidden to repeatedly Pin/Unpin in the loop;
GPU cache address and transfer length must be 4K-byte aligned;
GPU Direct device ioctl operation requires root privileges to run the program.

ꄴPrevious： null

ꄲNext： null