Test Performance of GPU Direct DMA RDMA and FPGA Communication on Jetson Platform
1 Technical Principles
1.1 What is GPU Direct RDMA
GPU Direct RDMA (Remote Direct Memory Access) is a high-performance data transfer technology provided by NVIDIA that allows third-party PCIe devices (such as FPGAs, network cards, video capture cards) to bypass the CPU and system memory, and directly exchange data with GPU video memory through the PCIe bus.
Traditional data path (FPGA → GPU):
FPGA ──DMA──→ CPU Memory
↑
PCIe Transfer (1st time)
──cudaMemcpy──→ GPU Video Memory
↑
Memory Bus Copy (2nd time)
GPU Direct RDMA path (FPGA → GPU):
FPGA ──PCIe DMA──→ GPU Video Memory
↑
Only one PCIe transfer, zero CPU copy
1.2 Core Advantages
| Advantage | Description |
|---|---|
| Reduced Latency | Eliminates CPU intermediate copy, reducing end-to-end latency by approximately 50% |
| Bandwidth Improvement | Avoids memory bus contention, increasing effective bandwidth by 1.5x~2.5x |
| CPU Offloading | CPU is completely idle during DMA transfer and can handle other tasks |
| Zero Copy | Data is directly transmitted to GPU without system memory transit buffering |
| Deterministic Latency | No CPU scheduling interference, suitable for real-time systems |
1.3 Adapted Devices and Platforms
| Platform | Support Status | GPU Type | Remarks |
|---|---|---|---|
| NVIDIA Jetson Orin (Tegra) | ✅ Verified | Integrated GPU (Unified Memory) | Via nvidia-p2p kernel interface |
| NVIDIA Jetson Xavier | ✅ Adaptable | Integrated GPU | Same architecture as Orin |
| x86 + NVIDIA Discrete GPU | ✅ Adaptable | Tesla/Quadro | Requires nvidia-peermem module |
| FPGA (Xilinx Kintex/Artix) | ✅ Verified | N/A | Via XDMA IP core + custom driver |
| Other PCIe DMA Devices | ✅ Extensible | - | Need to implement Pin/Unpin/Transfer ioctl |
1.4 Workflow
- GPU buffer allocation (cudaHostAlloc / cudaMalloc)
- Pin operation: Map GPU virtual address to physical pages and lock in memory
- DMA transfer: FPGA directly reads/writes GPU physical pages (via PCIe BAR)
- Unpin operation: Release page lock
Key point: The Pin operation only needs to be executed once, and the same Handle can be reused infinitely for subsequent DMA transfers, avoiding the address translation overhead required for each transfer in the traditional method.
2 Test Environment
2.1 Hardware Platform
| Component | Specification |
|---|---|
| Embedded Platform | NVIDIA Jetson Orin (aarch64) |
| GPU | Orin Integrated Ampere GPU, Unified Memory Architecture |
| FPGA | Xilinx Series, PCIe Gen3 x4 |
| FPGA-side Memory | DDR4 2GB |
| PCIe Link | Gen3 x4 (Theoretical peak ~4GB/s) |
| System Memory | LPDDR5 (Unified Memory) |
2.2 Software Environment
| Component | Version |
|---|---|
| OS | Ubuntu 22.04 (aarch64) |
| CUDA | 12.6 |
| Kernel Driver | HelloFPGA XDMA Custom Driver (v2020.2.2) + GPU Direct Extension |
| User-space Library | libHelloFPGACore.so (including GPU Direct compatibility layer) |
| Compiler | nvcc (CUDA 12.6) + GCC |
2.3 Driver Architecture
User Space: HelloFPGACore.so (TransferMode API)
│
├─ CPU Mode: open(/dev/HelloFPGA0_c2h_*) → read/write
│
└─ GPU Direct Mode: open → ioctl(XDMA_IOC_GPU_PIN/XFER/UNPIN)
│
Kernel Space: HelloFPGA.ko (XDMA + xdma_gpu_direct + xdma_gpu_tegra)
│
Hardware: FPGA XDMA IP ←──PCIe──→ GPU BAR (Physical Address Direct Access)
3 Test Methods
3.1 Test Tools
Test program: gpu_direct_api_test.cu
Supports two running modes:
- Quick functional test: Verify API correctness + performance comparison (about 2 minutes)
- Long-term steady-state stress test: Continuous operation for 12 hours, recording data every 10 minutes (--long parameter)
3.2 Test Comparison Scheme
| Path ID | Scheme Name | Data Flow | Description |
|---|---|---|---|
| [A] | CPU DMA Only | FPGA → CPU Memory | Traditional DMA, data stays on CPU side |
| [B] | FPGA→CPU→GPU Full Path | FPGA → CPU → GPU | Complete path for delivering data to GPU in traditional way |
| [C] | GPU Direct Handle | FPGA → GPU Direct | Pre-Pin + DMA direct transfer, no CPU transit |
3.3 Test Items
| Test Item | Content |
|---|---|
| API Function Verification | GetStatus / Pin / ReadC2H / WriteH2C / Unpin |
| Data Correctness | Write pattern → Read back → Byte-by-byte comparison |
| Multi-size Performance | Full coverage of 6 sizes from 4KB to 8MB |
| 2GB Address Space | Traverse the entire 0~2GB range of FPGA to verify no address dead spots |
| Multi-buffer Rotation | 4-frame GPU buffer cyclic acquisition, compared with single buffer |
| TransferMode Compatibility | Zero modification to old interfaces, internal automatic routing to GPU Direct |
| 12-hour Stability | 72 samples, full recording of power consumption / performance / jitter |
3.4 Key Parameters
- Transfer block size: 4MB (long test) / 4KB~8MB (quick test)
- FPGA address step: 64MB (traverse 0~2GB, 32 test points / round)
- Sampling frequency: every 10 minutes (long test)
- Iterations per sampling point: average of 5 times
- Power consumption collection: INA3221 sensor (VDD_IN channel)
4 Test Results
4.1 Multi-size Performance Comparison (Quick Test)
Latency comparison (μs, lower is better), table format: Read / Write
| Data Size | CPU DMA Only | FPGA→CPU→GPU Full Path | GPU Direct | Speedup Ratio (vs CPU) | Speedup Ratio (vs Full Path) |
|---|---|---|---|---|---|
| 4KB | 82.6 / 58.4 | 232.8 / 86.2 | 54.4 / 45.4 | 1.52x / 1.29x | 4.28x / 1.90x |
| 64KB | 87.8 / 80.4 | 122.4 / 195.2 | 70.2 / 74.4 | 1.25x / 1.08x | 1.74x / 2.62x |
| 512KB | 341.8 / 301.6 | 635.2 / 740.2 | 292.0 / 243.4 | 1.17x / 1.24x | 2.18x / 3.04x |
| 1MB | 638.4 / 574.6 | 1016.0 / 1219.2 | 521.2 / 412.6 | 1.22x / 1.39x | 1.95x / 2.95x |
| 4MB | 2398.0 / 2102.8 | 3417.6 / 3522.2 | 1773.6 / 1487.0 | 1.35x / 1.41x | 1.93x / 2.37x |
| 8MB | 4888.2 / 4250.0 | 6564.8 / 6507.6 | 3557.4 / 2951.2 | 1.37x / 1.44x | 1.85x / 2.21x |
Bandwidth comparison (GB/s, higher is better), table format: Read / Write
| Data Size | CPU DMA Only | FPGA→CPU→GPU | GPU Direct |
|---|---|---|---|
| 512KB | 1.53 / 1.74 | 0.83 / 0.71 | 1.80 / 2.15 |
| 1MB | 1.64 / 1.82 | 1.03 / 0.86 | 2.01 / 2.54 |
| 4MB | 1.75 / 1.99 | 1.23 / 1.19 | 2.36 / 2.82 |
| 8MB | 1.72 / 1.97 | 1.28 / 1.29 | 2.36 / 2.84 |
GPU Direct peak bandwidth reaches 2.84 GB/s, approaching the theoretical limit of PCIe Gen3 x4
4.2 TransferMode Compatibility Mode Performance
Zero modification to user code (still calling HelloFPGA_DMA_MM_*), switch mode only through 2 lines of configuration; Unit: μs, Format: Read / Write
| Data Size | CPU Mode | GPU_PINNED Mode | Speedup Ratio |
|---|---|---|---|
| 4KB | 79.6 / 65.4 | 57.8 / 51.8 | 1.38x / 1.26x |
| 64KB | 140.4 / 123.2 | 69.8 / 75.0 | 2.01x / 1.64x |
| 256KB | 198.2 / 210.6 | 141.4 / 155.2 | 1.40x / 1.36x |
| 1MB | 607.0 / 677.8 | 417.2 / 487.6 | 1.45x / 1.39x |
| 4MB | 2183.2 / 2516.2 | 1491.8 / 1781.4 | 1.46x / 1.41x |
| 8MB | 4279.4 / 4977.4 | 2954.4 / 3541.0 | 1.45x / 1.41x |
4.3 Multi-buffer vs Single-buffer Comparison
Simulate actual image acquisition scenario: 4 GPU buffers rotation vs single buffer repeated read/write (100 iterations, 1MB/frame)
| Indicator | Multi-buffer (4-frame rotation) | Single-buffer | Difference |
|---|---|---|---|
| Average Latency | 501.7 μs | 491.9 μs | - |
| Minimum Latency | 460.0 μs | 455.0 μs | - |
| Maximum Latency | 1124.0 μs | 660.0 μs | - |
| Average Bandwidth | 2.09 GB/s | 2.13 GB/s | - |
| Equivalent Frame Rate | 1993 fps | 2033 fps | 0.98x |
Conclusion: The overhead of multi-buffer address lookup and matching is negligible and does not affect performance.
4.4 12-hour Long-term Steady-state Test
Test duration: 12.00 hours; Number of sampling points: 72 times (every 10 minutes); Transfer block size: 4MB; Address traversal: 0 ~ 2GB (completed 2 full rounds of traversal)
Performance Statistics
| Indicator | CPU DMA Only | FPGA→CPU→GPU | GPU Direct |
|---|---|---|---|
| Average Read (μs) | 2711.9 | 3518.3 | 1785.1 |
| Average Write (μs) | 2276.2 | 3701.2 | 1489.0 |
| Average Bandwidth Read | 1.55 GB/s | 1.19 GB/s | 2.35 GB/s |
| Average Bandwidth Write | 1.84 GB/s | 1.13 GB/s | 2.82 GB/s |
Speedup Ratio (Baseline: Traditional Path)
| Comparison Baseline | Read | Write |
|---|---|---|
| GPU Direct vs CPU DMA | 1.52x | 1.53x |
| GPU Direct vs Traditional Full Path | 1.97x | 2.49x |
GPU Direct Performance Stability
| Indicator | GPU Direct Read | GPU Direct Write |
|---|---|---|
| Minimum Latency | 1749.6 μs | 1461.2 μs |
| Maximum Latency | 2008.0 μs | 1554.6 μs |
| Jitter (max-min) | 258.4 μs | 93.4 μs |
| Relative Fluctuation | ±7.2% | ±3.1% |
Performance Trend by Time Period (No Performance Degradation)
| Time Period | GPU Read Average | GPU Write Average | Power Consumption |
|---|---|---|---|
| 0 ~ 4h | 1782 μs | 1483 μs | 18.38 W |
| 4 ~ 8h | 1784 μs | 1490 μs | 18.65 W |
| 8 ~ 12h | 1786 μs | 1488 μs | 18.78 W |
Power Consumption Statistics
| Indicator | Value |
|---|---|
| Average Power Consumption | 18.57 W |
| Minimum Power Consumption | 18.23 W |
| Maximum Power Consumption | 18.91 W |
| Fluctuation Range | 0.69 W (±1.8%) |
Address Consistency Conclusion: There is no significant difference in GPU Direct transfer latency across the full 2GB address range of FPGA (0MB ~ 1984MB, 64MB step) (standard deviation < 15μs), and the performance of each area of FPGA DDR is balanced.
5 Conclusions
5.1 Performance Conclusions
- Compared with the traditional FPGA→CPU→GPU full path, GPU Direct accelerates Read by 1.97 times and Write by 2.49 times; the core benefit comes from eliminating the cudaMemcpy intermediate copy;
- It still improves by about 1.5 times compared with the CPU-only DMA scheme, and the zero-copy architecture has a natural advantage;
- The measured peak bandwidth is 2.84GB/s, close to the theoretical throughput upper limit of PCIe Gen3 x4 hardware.
5.2 Stability Conclusions
- No performance degradation throughout the 12-hour long test, and the latency difference between the first and last 4 hours is less than 0.3%;
- No memory leaks, and the Pin/Unpin resource release logic is normal;
- Power consumption is stable without overheating and frequency reduction, and there are no performance hotspots in all DDR addresses of FPGA.
5.3 Compatibility Conclusions
- Fully backward compatible with the original upper-layer business code, no modification required;
- Extremely low access cost, only two new lines of configuration code are needed to switch to GPU direct transfer mode;
- Multi-frame image buffer rotation has almost no performance loss, adapting to machine vision stream processing scenarios.
5.4 Recommended Scenarios
| Scenario | Recommended Scheme | Expected Speedup |
|---|---|---|
| FPGA Image Acquisition → GPU Inference | GPU Direct (Multi-buffer) | 2.0x |
| FPGA Signal Processing → GPU Computing | GPU Direct (Single-buffer) | 1.5x~2.0x |
| FPGA ↔ CPU Pure Memory Interaction (No GPU) | Traditional CPU DMA | No need to switch |
| Small Data Packets <4KB | Traditional CPU DMA | Insignificant improvement |
5.5 Recommendations and Limitations
- Optimal transfer block ≥64KB: The overhead of small packet DMA establishment accounts for a high proportion, and the gain of GPU Direct is limited;
- Pin operation is only executed once during initialization; it is forbidden to repeatedly Pin/Unpin in the loop;
- GPU cache address and transfer length must be 4K-byte aligned;
- GPU Direct device ioctl operation requires root privileges to run the program.

