Enhancing Content Creation Workflows through Automatic Speech Recognition Techniques

Randy Fayan, Zahra Montajabi, Rob Gonsalves

This research provides a comprehensive review of prevailing Automatic Speech Recognition (ASR) methods and their profound impact on media production. Central to this exploration is the pivotal role of the transformer model, showcasing its unique self-attention mechanism ideal for tasks demanding comprehension of temporal relationships. We conduct an in- depth comparison, evaluating leading ASR models like Multilingual Machine Speech (MMS) from Meta, Whisper from OpenAI, and Google's Universal Speech Model (USM). Their performance is gauged against commercial services from tech giants such as Microsoft Azure, Amazon Web Services, and Google Cloud Platform. The paper covers aspects of ASR systems like voice activity detection, language identification, multilanguage support and the metrics used to characterize the accuracy of the systems. The research also identifies important challenges, like the dearth of data for some languages and the intricacies associated with linguistic nuances. Further, we discuss the role of ASR in media production, ranging from generating time-based captions to revolutionizing editing methodologies. Through a detailed breakdown of the ASR process, from audio preprocessing to postprocessing, this study endeavors to merge academic insights with practical applications, empowering media producers to leverage the immense capabilities of contemporary ASR technologies.

Published: 2023-10
Content type: Original Research
Keywords: Artificial Intelligence, Machine Learning, Automatic Speech Recognition
DOI: 10.5594/M002027
ISBN: 978-1-61482-964-5