Incremental Live Video Processing in IP-Based GPU-Leveraging Software Workflows

Dennis Sandler, Thomas True

Live media production is in transition from fixed-function appliances towards general-purpose commercial-off-the-shelf (COTS) hardware in order to take advantage of the economy of scale and flexibility that software-driven media workflows provide. This added flexibility of general-purpose hardware shifts much of the responsibility of meeting a media flow's throughput and latency requirements to the software stack. While it is important for the media function's software to have sufficient performance to meet these requirements, it is also critical to have optimal performance in order to maximize the mutti-tenency potential of media workflows across a computer system. That is, by making efficient use of the COTS system's CPUs, GPUs, memory, buses, and network interface cards (NICs) they become more available to other software media functions running on the system. This increases the amount of media processing per unit hardware (or “density”), and therefore the amount of media processing per dollar - whether this cost is in equipping and maintaining a data center, or in cloud service subscription fees. Recently it has been shown that for IP-based software media workflows running on COTS which deal with uncompressed or mezzanine video streams, the majority of the media processing can be offloaded to GPUs and high-bandwidth NICs. GPUs are suitable for image processing functions, as well as video decoding, and encoding. This frees up CPU resources and accelerates the execution of these operations compared to CPUs. Additionally, NICs with GPU Remote Direct Memory Access (RDMA) can transfer incoming video frames directly into GPU memory - thereby saving a lot of bus traffic (typically PCI bus). Live media is inherently paced and so a software program would need to wait some amount of time to begin processing the incoming media. Currently, IP-based video workflows that ingest uncompressed or mezzanine video streams into a GPU-RDMA-capable NIC will wait for an entire video frame to arrive before doing any image processing and/or encoding on a GPU. We demonstrate a novel approach for incrementally processing a SMPTE-211 0–20 stream as a frame is being received. By taking advantage of SMPTE-2110's reliable pacing and HEVC's and A V1's slice specification, we demonstrate an un compressed media ingest pipeline which then encodes it into either HEVC or A V1. Preliminary results show a substantial decrease in the overall latency, as well as being able to run more streams on a single system.

Published: 2024-10-21
Content type: Original Research
Keywords: software defined workflows, encoding, gpu, nic, latency, stream density
DOI: 10.5594/MOO/3005
ISBN: 978-1-61482-965-2