Automatic Speech Recognition with Machine Learning: Techniques and Evaluation of Current Tools

Randy Fayan, Zahra Montajabi, Rob Gonsalves

This research offers an in-depth review of current Automatic Speech Recognition (ASR) methods and their significant impact on media production, with a focus on the transformer model's self-attention mechanism for understanding sequential relationships. It compares accuracy and performance of top ASR models like Meta's Multilingual Machine Speech, OpenAI's Whisper, and Google's Universal Speech Model along with services from Microsoft Azure, Amazon Web Services, and Google Cloud Platform. The study examines key ASR aspects, including voice activity detection, language identification, and multilanguage support, and evaluates their accuracy metrics. Challenges such as limited data for certain languages and complexities in linguistic nuances are highlighted. Additionally, the paper discusses ASR's role in media production, from creating time-based captions to transforming editing techniques. By analyzing the ASR process from audio preprocessing to post-processing, the research bridges academic and practical perspectives, enabling media producers to utilize advanced ASR technologies effectively.

Print ISSN: 1545-0279
Electronic ISSN: 2160-2492
Published: 2024-04
Content type: Original Research
Keywords: artificial intelligence, machine learning, automatic speech recognition
DOI: 10.5594/JMI.2024/IPYX8877