Seamless Audio Splicing for ISO/IEC 13818 Transport Streams

Seyfullah Halit Oguz, Sorin Faibish

The work reported in this paper addresses the processing of the packetized audio elementary streams (ESs) during the splicing of ISO/IEC 13818 (MPEG-2) transport streams (TS). An algorithm is developed which without the need for decoding and re-encoding the audio ESs, produces a continuous audio ES across the splicing point, avoids all noticeable audio artifacts and hence provides a seamless audio splice. The perfect continuity with respect to the presentation time stamps (PTS), and the careful management of the audio buffer level at all times lead to a glitch-free audio play-out for an unlimited number of splices. The proposed algorithm parses the packetized audio ES up to the audio access unit (AAU) level and is computationally very simple. — In the MPEG-2 syntax, the coded data of the two fundamental signal components, namely the video and the audio, are encapsulated in the so called access units (AUs). An AU is the smallest self-contained segment of the compressed data which can be decoded and presented. The coded representations of one video frame and a certain number of audio PCM samples (depending on the audio encoding algorithm layer), correspond to AUs for video and audio respectively. The play-out (presentation) duration of a video access unit (VAU) is a function of the input video source format (NTSC or PAL) whereas the play-out duration of an AAU depends on the audio encoding algorithm layer employed and the sampling frequency of the audio signal. It turns out that with the allowable ranges of these parameters, the decoded forms of video and audio AUs known as presentation units (PUs) are practically never aligned in terms of the ending of their play-out durations. This misalignment brings in the problem of having an unequal (in terms of total play-out durations) amount of video and audio data when a program is terminated at a splicing point. The processing of the packetized audio elementary streams so far was achieved either by leaving an audio gap at the splicing point leading to an undesirable audio mute or by decoding and re-encoding of the audio signal which is computationally very costly. — We propose a solution to this problem by introducing the so called “best aligned audio presentation units (APUs)”. The best aligned APU for an ending program is defined as the APU whose presentation interval's ending time is best aligned with the ending instant of the presentation interval of the last video presentation unit (VPU) to be displayed from this program. For a starting program, the best aligned APU is defined similarly but this time with respect to the starting instants of the relevant PUs. Based on the best aligned APUs of the ending and starting programs, the audio PTSs in the starting program are re-stamped to achieve a continuous audio stream across the splicing point. It can be easily proved that through the use of the best aligned APUs, the audio-visual skew introduced in the starting program is upper bounded by half of the play-out duration of an APU which is far below the sensitivity of the human observer, (typically a fraction of the video frame interval). Cases where the immediate use of best aligned APUs is possible are known as minimal achievable skew cases. The inevitable change in the mean audio buffer level of the starting program introduced by the re-stamping may make the immediate use of best aligned APUs impossible. This will be due to an original audio buffer level dynamics in the starting clip which is not safely bounded away from underflow or overflow. Such marginal cases are carefully managed by making a possible addition to or a deletion from the set of APUs described by the best aligned APUs. Consequently, the inevitable change in the starting program's mean audio buffer level is introduced in a controlled fashion in the proper direction which will avoid underflows or overflows. Cases where a deviation from the use of best aligned APUs is necessary are known as minimal achievable safe skew cases. It can again be easily proved that the minimal achievable safe skew is upper bounded by the play-out duration of an APU which is again well below noticeable limits. A very simple model of the packetized audio ES is also developed which can facilitate the classification of the starting program's audio buffer dynamics as mentioned above. — The principles of the proposed algorithm are very general and can be easily extended to cover other audio encoding algorithms with AU based data encapsulation and other forms of data encapsulated in AUs.

Published: 2000-10
Content type: Original Research
DOI: 10.5594/M00170
ISBN: 978-1-61482-933-1