Exploring Automated Voice Casting for Content Localization Using Deep Learning
Aansh Malik, Ha Nguyen
Casting voice-actors to dub source language content into a target language—known as voice casting—consists largely of a manual workflow that could benefit immensely from increased levels of automation. Recent advancements in deep learning architectures for sequential data processing are providing the needed impetus to the realization of various AI-enabled audio-processing workflows. Specifically, applications such as speaker verification and speech synthesis have been gaining immense traction due to the advent and maturity of recurrent neural networks. We explore the viability of leveraging advancements in deep learning for text-independent speaker verification (TI-SV) for use in computer-aided voice casting. To this end, we propose and develop an automated voice-casting tool that uses similarity scores generated from neural network embeddings—from a robust autoencoder model trained for the task of TI-SV—to rank voiceover artists across different languages in voice-casting process. To evaluate the dexterity of the proposed approach, we conduct a subjective study emulating a simplified voice-casting process on actual voice-testing kits (dubbing auditions) from our content. We also use casting decisions from casting experts to further evaluate the tool as well as the subjectivity involved in the voice-casting process. We achieve promising results for the automated tool and prove that it could be a viable approach to automating the voice-casting process and warrants further exploration.