Homepage    Technical Sharing    Test Performance of GPU Direct DMA RDMA and FPGA Communication on Jetson Platform

Test Performance of GPU Direct DMA RDMA and FPGA Communication on Jetson Platform

Created on:2026-07-05 15:34
Test Performance of GPU Direct DMA RDMA and FPGA Communication on Jetson Platform

1 Technical Principles

1.1 What is GPU Direct RDMA

GPU Direct RDMA (Remote Direct Memory Access) is a high-performance data transfer technology provided by NVIDIA that allows third-party PCIe devices (such as FPGAs, network cards, video capture cards) to bypass the CPU and system memory, and directly exchange data with GPU video memory through the PCIe bus.

Traditional data path (FPGA → GPU):

FPGA ──DMA──→ CPU Memory
         ↑
      PCIe Transfer (1st time)
                ──cudaMemcpy──→ GPU Video Memory
                                       ↑
                                  Memory Bus Copy (2nd time)

GPU Direct RDMA path (FPGA → GPU):

FPGA ──PCIe DMA──→ GPU Video Memory
         ↑
    Only one PCIe transfer, zero CPU copy

GPU Direct RDMA Architecture Diagram

1.2 Core Advantages

Advantage Description
Reduced Latency Eliminates CPU intermediate copy, reducing end-to-end latency by approximately 50%
Bandwidth Improvement Avoids memory bus contention, increasing effective bandwidth by 1.5x~2.5x
CPU Offloading CPU is completely idle during DMA transfer and can handle other tasks
Zero Copy Data is directly transmitted to GPU without system memory transit buffering
Deterministic Latency No CPU scheduling interference, suitable for real-time systems

1.3 Adapted Devices and Platforms

Platform Support Status GPU Type Remarks
NVIDIA Jetson Orin (Tegra) ✅ Verified Integrated GPU (Unified Memory) Via nvidia-p2p kernel interface
NVIDIA Jetson Xavier ✅ Adaptable Integrated GPU Same architecture as Orin
x86 + NVIDIA Discrete GPU ✅ Adaptable Tesla/Quadro Requires nvidia-peermem module
FPGA (Xilinx Kintex/Artix) ✅ Verified N/A Via XDMA IP core + custom driver
Other PCIe DMA Devices ✅ Extensible - Need to implement Pin/Unpin/Transfer ioctl

1.4 Workflow

  1. GPU buffer allocation (cudaHostAlloc / cudaMalloc)
  2. Pin operation: Map GPU virtual address to physical pages and lock in memory
  3. DMA transfer: FPGA directly reads/writes GPU physical pages (via PCIe BAR)
  4. Unpin operation: Release page lock

Key point: The Pin operation only needs to be executed once, and the same Handle can be reused infinitely for subsequent DMA transfers, avoiding the address translation overhead required for each transfer in the traditional method.

2 Test Environment

2.1 Hardware Platform

Xingce Electronics JetKU hardware, PCIe Gen3x4 between FPGA and Jetson NX, FPGA with 2GB cache, using Xilinx XDMA IP

Component Specification
Embedded Platform NVIDIA Jetson Orin (aarch64)
GPU Orin Integrated Ampere GPU, Unified Memory Architecture
FPGA Xilinx Series, PCIe Gen3 x4
FPGA-side Memory DDR4 2GB
PCIe Link Gen3 x4 (Theoretical peak ~4GB/s)
System Memory LPDDR5 (Unified Memory)

Hardware Connection Diagram

2.2 Software Environment

Component Version
OS Ubuntu 22.04 (aarch64)
CUDA 12.6
Kernel Driver HelloFPGA XDMA Custom Driver (v2020.2.2) + GPU Direct Extension
User-space Library libHelloFPGACore.so (including GPU Direct compatibility layer)
Compiler nvcc (CUDA 12.6) + GCC

2.3 Driver Architecture

User Space:  HelloFPGACore.so (TransferMode API)
               │
               ├─ CPU Mode: open(/dev/HelloFPGA0_c2h_*) → read/write
               │
               └─ GPU Direct Mode: open → ioctl(XDMA_IOC_GPU_PIN/XFER/UNPIN)
                        │
Kernel Space:  HelloFPGA.ko (XDMA + xdma_gpu_direct + xdma_gpu_tegra)
                        │
Hardware:      FPGA XDMA IP ←──PCIe──→ GPU BAR (Physical Address Direct Access)

3 Test Methods

3.1 Test Tools

Test program: gpu_direct_api_test.cu

Supports two running modes:

  • Quick functional test: Verify API correctness + performance comparison (about 2 minutes)
  • Long-term steady-state stress test: Continuous operation for 12 hours, recording data every 10 minutes (--long parameter)

3.2 Test Comparison Scheme

Path ID Scheme Name Data Flow Description
[A] CPU DMA Only FPGA → CPU Memory Traditional DMA, data stays on CPU side
[B] FPGA→CPU→GPU Full Path FPGA → CPU → GPU Complete path for delivering data to GPU in traditional way
[C] GPU Direct Handle FPGA → GPU Direct Pre-Pin + DMA direct transfer, no CPU transit

3.3 Test Items

Test Item Content
API Function Verification GetStatus / Pin / ReadC2H / WriteH2C / Unpin
Data Correctness Write pattern → Read back → Byte-by-byte comparison
Multi-size Performance Full coverage of 6 sizes from 4KB to 8MB
2GB Address Space Traverse the entire 0~2GB range of FPGA to verify no address dead spots
Multi-buffer Rotation 4-frame GPU buffer cyclic acquisition, compared with single buffer
TransferMode Compatibility Zero modification to old interfaces, internal automatic routing to GPU Direct
12-hour Stability 72 samples, full recording of power consumption / performance / jitter

3.4 Key Parameters

  • Transfer block size: 4MB (long test) / 4KB~8MB (quick test)
  • FPGA address step: 64MB (traverse 0~2GB, 32 test points / round)
  • Sampling frequency: every 10 minutes (long test)
  • Iterations per sampling point: average of 5 times
  • Power consumption collection: INA3221 sensor (VDD_IN channel)

4 Test Results

4.1 Multi-size Performance Comparison (Quick Test)

Latency comparison (μs, lower is better), table format: Read / Write

Data Size CPU DMA Only FPGA→CPU→GPU Full Path GPU Direct Speedup Ratio (vs CPU) Speedup Ratio (vs Full Path)
4KB 82.6 / 58.4 232.8 / 86.2 54.4 / 45.4 1.52x / 1.29x 4.28x / 1.90x
64KB 87.8 / 80.4 122.4 / 195.2 70.2 / 74.4 1.25x / 1.08x 1.74x / 2.62x
512KB 341.8 / 301.6 635.2 / 740.2 292.0 / 243.4 1.17x / 1.24x 2.18x / 3.04x
1MB 638.4 / 574.6 1016.0 / 1219.2 521.2 / 412.6 1.22x / 1.39x 1.95x / 2.95x
4MB 2398.0 / 2102.8 3417.6 / 3522.2 1773.6 / 1487.0 1.35x / 1.41x 1.93x / 2.37x
8MB 4888.2 / 4250.0 6564.8 / 6507.6 3557.4 / 2951.2 1.37x / 1.44x 1.85x / 2.21x

Bandwidth comparison (GB/s, higher is better), table format: Read / Write

Data Size CPU DMA Only FPGA→CPU→GPU GPU Direct
512KB 1.53 / 1.74 0.83 / 0.71 1.80 / 2.15
1MB 1.64 / 1.82 1.03 / 0.86 2.01 / 2.54
4MB 1.75 / 1.99 1.23 / 1.19 2.36 / 2.82
8MB 1.72 / 1.97 1.28 / 1.29 2.36 / 2.84

GPU Direct peak bandwidth reaches 2.84 GB/s, approaching the theoretical limit of PCIe Gen3 x4

4.2 TransferMode Compatibility Mode Performance

Zero modification to user code (still calling HelloFPGA_DMA_MM_*), switch mode only through 2 lines of configuration; Unit: μs, Format: Read / Write

Data Size CPU Mode GPU_PINNED Mode Speedup Ratio
4KB 79.6 / 65.4 57.8 / 51.8 1.38x / 1.26x
64KB 140.4 / 123.2 69.8 / 75.0 2.01x / 1.64x
256KB 198.2 / 210.6 141.4 / 155.2 1.40x / 1.36x
1MB 607.0 / 677.8 417.2 / 487.6 1.45x / 1.39x
4MB 2183.2 / 2516.2 1491.8 / 1781.4 1.46x / 1.41x
8MB 4279.4 / 4977.4 2954.4 / 3541.0 1.45x / 1.41x

4.3 Multi-buffer vs Single-buffer Comparison

Simulate actual image acquisition scenario: 4 GPU buffers rotation vs single buffer repeated read/write (100 iterations, 1MB/frame)

Indicator Multi-buffer (4-frame rotation) Single-buffer Difference
Average Latency 501.7 μs 491.9 μs -
Minimum Latency 460.0 μs 455.0 μs -
Maximum Latency 1124.0 μs 660.0 μs -
Average Bandwidth 2.09 GB/s 2.13 GB/s -
Equivalent Frame Rate 1993 fps 2033 fps 0.98x

Conclusion: The overhead of multi-buffer address lookup and matching is negligible and does not affect performance.

4.4 12-hour Long-term Steady-state Test

Test duration: 12.00 hours; Number of sampling points: 72 times (every 10 minutes); Transfer block size: 4MB; Address traversal: 0 ~ 2GB (completed 2 full rounds of traversal)

Performance Statistics

Indicator CPU DMA Only FPGA→CPU→GPU GPU Direct
Average Read (μs) 2711.9 3518.3 1785.1
Average Write (μs) 2276.2 3701.2 1489.0
Average Bandwidth Read 1.55 GB/s 1.19 GB/s 2.35 GB/s
Average Bandwidth Write 1.84 GB/s 1.13 GB/s 2.82 GB/s

Speedup Ratio (Baseline: Traditional Path)

Comparison Baseline Read Write
GPU Direct vs CPU DMA 1.52x 1.53x
GPU Direct vs Traditional Full Path 1.97x 2.49x

GPU Direct Performance Stability

Indicator GPU Direct Read GPU Direct Write
Minimum Latency 1749.6 μs 1461.2 μs
Maximum Latency 2008.0 μs 1554.6 μs
Jitter (max-min) 258.4 μs 93.4 μs
Relative Fluctuation ±7.2% ±3.1%

Performance Trend by Time Period (No Performance Degradation)

Time Period GPU Read Average GPU Write Average Power Consumption
0 ~ 4h 1782 μs 1483 μs 18.38 W
4 ~ 8h 1784 μs 1490 μs 18.65 W
8 ~ 12h 1786 μs 1488 μs 18.78 W

Power Consumption Statistics

Indicator Value
Average Power Consumption 18.57 W
Minimum Power Consumption 18.23 W
Maximum Power Consumption 18.91 W
Fluctuation Range 0.69 W (±1.8%)

Address Consistency Conclusion: There is no significant difference in GPU Direct transfer latency across the full 2GB address range of FPGA (0MB ~ 1984MB, 64MB step) (standard deviation < 15μs), and the performance of each area of FPGA DDR is balanced.

5 Conclusions

5.1 Performance Conclusions

  • Compared with the traditional FPGA→CPU→GPU full path, GPU Direct accelerates Read by 1.97 times and Write by 2.49 times; the core benefit comes from eliminating the cudaMemcpy intermediate copy;
  • It still improves by about 1.5 times compared with the CPU-only DMA scheme, and the zero-copy architecture has a natural advantage;
  • The measured peak bandwidth is 2.84GB/s, close to the theoretical throughput upper limit of PCIe Gen3 x4 hardware.

5.2 Stability Conclusions

  • No performance degradation throughout the 12-hour long test, and the latency difference between the first and last 4 hours is less than 0.3%;
  • No memory leaks, and the Pin/Unpin resource release logic is normal;
  • Power consumption is stable without overheating and frequency reduction, and there are no performance hotspots in all DDR addresses of FPGA.

5.3 Compatibility Conclusions

  • Fully backward compatible with the original upper-layer business code, no modification required;
  • Extremely low access cost, only two new lines of configuration code are needed to switch to GPU direct transfer mode;
  • Multi-frame image buffer rotation has almost no performance loss, adapting to machine vision stream processing scenarios.

5.4 Recommended Scenarios

Scenario Recommended Scheme Expected Speedup
FPGA Image Acquisition → GPU Inference GPU Direct (Multi-buffer) 2.0x
FPGA Signal Processing → GPU Computing GPU Direct (Single-buffer) 1.5x~2.0x
FPGA ↔ CPU Pure Memory Interaction (No GPU) Traditional CPU DMA No need to switch
Small Data Packets <4KB Traditional CPU DMA Insignificant improvement

5.5 Recommendations and Limitations

  • Optimal transfer block ≥64KB: The overhead of small packet DMA establishment accounts for a high proportion, and the gain of GPU Direct is limited;
  • Pin operation is only executed once during initialization; it is forbidden to repeatedly Pin/Unpin in the loop;
  • GPU cache address and transfer length must be 4K-byte aligned;
  • GPU Direct device ioctl operation requires root privileges to run the program.