Device Memory Sharing Study in Broadcast Systems

Jonas Ohland, Michael Lefebvre, Thomas J. True, Gareth Sylvester-Bradley, Sithideth Viengkhou

Broadcast systems continue their transition to standard commercial off-the-shelf (COTS) computing platforms and growing reliance on Graphic Processing Unit (GPU)-accelerated processing for real-time media transcoding, AI, and image manipulation demands optimized data exchange within compute nodes and across distributed systems. This shift introduces new challenges in memory sharing and communication. Using internode1 Remote Direct Memory Access (RDMA): This study presents a comparative evaluation of memory sharing mechanisms across three distinct GPU transfer paths used in High-Performance Computing (HPC) and media processing: inter-node GPU-GPU, intra-node host-GPU and intra-node GPU-GPU memory exchange. The performance of each path is assessed under different software configurations: native memory operations without the aid of any communication framework-referred to as native, and higher-level abstractions, like Unified Communication X (UCX)2 and Libfabric.3’ Key performance metrics, including PCle4 (Peripheral Component Interconnect Express) bandwidth, latency, and Central Processing Unit (CPU) utilization, are measured. These results are intended to support the development of applications that require efficient and cost-effective memory sharing between computational devices. Additionnally, this study highlights how higher-level communication frameworks provide flexible abstraction.

Print ISSN: 1545-0279
Electronic ISSN: 2160-2492
Published: 2026-05
Content type: Original Research
Keywords: gpu, libfabric, ucx, cots, mxl, rdma, cuda
DOI: 10.5594/JMI.2026/TEYI3961