Technology

The Future of Artificial Intelligence Voice Cloning

Published

on

Voice cloning using artificial intelligence is both tedious and mysteriously simple. It is an emerging technology to replace the robotic sound of virtual assistants with natural human voices. Interestingly, voice cloning can render unique human agents to create more engaging media content. So if you are a podcaster, a filmmaker, or a game developer, this article is for you.

How Do Artificial Intelligence Powered Voice Cloning Works? 

Voice cloning software is the speech counterpart of video deepfake. Using a short recorded speech, developers can create an audio data set and train an AI voice model that can read any text from the target voice. First, the speaker should talk into a microphone for 30 minutes or so, reading a script as clearly as he can. The AI voice clone works by sending off the resulting audio file into the software. Then, anything typed into a chat box will be spoken back as the AI voice clone. The result is realistic to deceive even friends and family – at least for a few moments. It is worthy to note that this technology was not created to commit fraud or deceive other people. 

Voice cloning technology has rapidly improved in recent years due to innovations in machine learning. Before, the real-time synthetic voices were created by recording audio of voice talent, slicing up their speech into component sounds, and putting these back together into a ransom note effect. Now, neural networks can be trained using random target voice data to generate raw audio from the person speaking from scratch. The results are faster, dynamic, and more realistic than the previous method. Well, the quality is not yet perfect when it comes out from the machine. Through manual editing and improving, it is possible to improve in the future.

Several start-ups are offering this particular service. Just ask Google to help you, type “AI voice synthesis” or “AI voice deepfakes” on the search bar, and voila, you will learn how standard the technology is. Companies that focus on speech synthesis include Resemble.AI, Respeecher,, Veritone, and  Descript.

Resemble.AI 

Based in Canada, Resemble.AI develops custom voices using proprietary deep learning models that create realistic speech synthesis. According to Crunchbase, the firm has raised $4 million in funding over three seed rounds. Eight investors fund this, including Spacecadet Ventures and Betaworks.

Respeecher

Friends and colleagues Alex Serdiuk, Dmytro Bielievtsov, and Grant Reaber established Reespecher in February 2018. They use artificial intelligence for speech synthesis to help filmmakers, TV producers, game developers, advertisers, podcasters, and other content creators create innovative content.

In March 2020, Ukraine-based Respeecher received a total of $1.5 million in start-up funding from ffVenture Capital, Acrobat or Ventures, ICU Ventures, Network VC, and several angel investors.

Veritone

Veritone, Inc. was established by Chad and Ryan Steelberg in 2014 to help everyone use the ability of artificial intelligence to reimagine things. Headquartered in Denver, Colorado, USA, Veritone is a leader in AI technology and solutions. The company’s proprietary operating system, aiWARE, expands machine learning models to transform audio, video, and other data sets into information that a company can gain a competitive edge.

Descript

Descript uses artificial intelligence for building a platform for making audio and video as fast and easy as Google Docs. Their software is famous for podcasting, video editing, screen recording, and transcription.

The Controversial Audio DeepFake in Anthony Bourdain Documentary

The most realistic AI-generated voice clone of MMA commentator-turned-podcaster Joe Rogan surfaced in May 2019. In July 2021, there were few lines of dialogues in Anthony Bourdain’s voice that he might have ever said. Viewers heard it in a documentary, Roadrunner, featuring the life and tragic death of the American celebrity chef. The buzz-worthy issue sparked when the film’s creators revealed that they used artificial intelligence to re-create the late chef’s voice. 

Also, they used the software to synthesize the sound of the three quotes from Anthony Bourdain. The celebrity’s deepfaked voice was discovered when The New Yorker correspondent Helen Rosner asked how the award-winning filmmaker Morgan Neville got a copy of Bourdain’s voice reading an email he had sent of one of his friends.

Then, in August 2021, the start-up Sonantic announced that they had created an AI voice clone of actor Val Kilmer. Kilmer’s voice was damaged in 2014 after undergoing surgery as part of his treatment for throat cancer.

These examples also illustrate some of the social and ethical aspects of this technology. Many consider the Bourdain case exploitative, whereas the Kilmer case has been lauded by many for using the technology to deliver a practical solution. 

In the next few years, influencers and other celebrities could rent their voices to companies, all thanks to artificial intelligence

Voice clones using celebrities as original speakers are likely to be the most remarkable applications in the future. Companies or organizations hoping to become famous and profitable might tap the stars by cloning and renting out their voices. To cite an example, Veritone launched such services earlier this year. The firm announced that they would let influencers, actors, and athletes license their AI voices for future advertisements without going into a studio or filming locations.

This kind of application is not yet prevalent, but there’s a possibility that it would be another way for celebrities to make money. For instance, Hollywood actor Bruce Willis has already licensed his image as a visual deepfake for a mobile phone in Russia. The project allows him to earn without leaving his house. It’s a win-win situation because the advertising agency would get a famous actor and even a much younger version of Willis for their commercials. Therefore, audio and visual clones further economic growth among celebrities, enabling them to capitalize on their popularity.

What About Others?

But, what does this technology mean for ordinary people? For those of us who are not as famous as celebrities and influencers? Well, it depends on the applications or uses of the technology. It is easy to understand a video game wherein the character creation features an option to create a voice clone. For gaming fans, it means a more interactive and engaging play. In other situations, there might be an application that allows parents to copy their voices to read bedtime stories to their young children even when they are not at home. These solutions are possible in today’s evolving technology.

However, there are also disadvantages. Fraudsters or scammers could exploit voice clones to trick companies and individuals into transferring money into their accounts. For teenagers, there is a potential danger of recording a peer’s voice, cloning them, and badmouthing a teacher or anyone from the campus. Moreover, using the technology for nonconsensual pornography poses one of the biggest threats.

Above all these, anyone could allow themselves to be subject to an AI voice clone. Thus, the formulation of guidelines and clear-cut policies will help minimize the dangers of this emerging technology.

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Exit mobile version