Share

Do you know when you first experienced voice synthesis? For many of us, it was watching The Andy Warhol Diaries on Netflix last year. 

The six-part series gave a uniquely personal glimpse into the mind of one of the 20th century's most innovative artists by narrating recreated scenes and archival footage with a voiceover of Warhol’s personal diaries. Perhaps most engaging is that it’s Warhol’s voice doing the talking.

And no, he didn’t record his diaries before his death in 1987. While he did intend for his diaries to be published posthumously and often dictated them over the phone to his friend Pat Hackett to transcribe, they weren’t actually recorded at the time.

Instead, series director Andrew Rossi obtained explicit permission from the Andy Warhol foundation to create an artificial intelligence-generated version of Warhol’s voice using a technique known as speech-to-speech voice synthesis. This is a process in which two voices are blended as one: one being an AI trained on archival recording of Warhol speaking, and the other being a voice actor mimicking the vocal cadence of Warhol and his native Pittsburgh accent.

As artificial intelligence and machine learning tools develop, so too does the realism and efficiency of creating synthetic media.

This is just one brilliant case study where AI-based voice synthesis technology is at its best. It adds to the TV show by totally absorbing us into the story and creating a certain vulnerability in the way it’s told. After all, it’s Andy reading his own diaries. So as viewers, we can rest a bit easier knowing the words coming out of his mouth truly are his personal point of view.

And it’s a project that couldn’t have been done even three years ago.

Netflix – The Andy Warhol Diaries (From Executive Producer Ryan Murphy) | Official Trailer

Credits
powered by Source





Unlock full credits and more with a Source + shots membership.

Credits powered by Source
Above: The trailer for Netflix's Andy Warhol Diaries, narrated by an AI version of the artist.


From stuttering toddler to vocal acrobat

As artificial intelligence and machine learning tools develop, so too does the realism and efficiency of creating synthetic media. Many of us are already familiar with deepfakes, where human faces can be recreated with a slightly uncanny yet still believable realism (anyone else see I'm not Morgan Freeman on YouTube?). Or more recently, OpenAI’s DALL-E, which allows you to generate stunning artwork based on text prompts. It’s so lifelike that it’s hard to fully grasp that this media was created by AI.

Synthetic voices are also having their own, albeit slightly quieter, revolution.

I first got into voice synthesis technology a few years ago when Ambassadors Lab, the studio for innovation and creative technology within creative production studio Ambassadors, started exploring the benefits of machine learning for creative production and advertising. AI voices had piqued my interest.

With technology advancing rapidly, we and the industry at large are now able to create voices that sound unbelievably real.

We wondered whether AI could be used to produce realistic voices. Could it be used to speak multiple languages? Could it be moulded into specific characteristics, such as a deep male voice or a bright female one? Could it ever be so lifelike that it could evoke emotion?

Until recently, the answer would have been a maybe. However, with technology advancing rapidly, we and the industry at large are now able to create voices that sound unbelievably real. We can give the AI text as input (text-to-speech). We can let AI change the voice’s emotion. We can even get it to repeat what we say with the exact speed and intonation, transformed into a different voice (speech-to-speech).

Today, the quality of synthesized voices is surpassing that of Siri, Alexa, or the AI-generated voices you hear on TikTok. We now have the tools to produce convincing results with unprecedented speed and creative precision. And that’s necessary if synthesized voices are to be taken seriously in the creative industry.

Click image to enlarge
Above: A handful of OpenAI’s DALL-E creations: A Shiba Inu dog wearing a beret and black turtleneck, An oil pastel drawing of an annoyed cat in a spaceship, and A van Gogh-style painting of an American football player.


Letting creativity speak louder

The way I see it, AI-based voice synthesis technology is an exciting creative tool, just like how VR broke a barrier in creating empathy and experience in digital spaces, or how advancements in VFX allow us to create highly believable CG animation.

And of course, like many creative tools, voice synthesis has found its foothold in Hollywood. Not just through recreating the voices of those long gone like Warhol, but also through creating stunning replicas of living actors. For Darth Vader actor James Earl, it meant he could finally retire at 91 while still letting the character of Darth Vader live on for many more Star Wars films to come. For Val Kilmer who lost his natural speaking voice due to throat cancer, it meant getting his voice back to appear in Top Gun: Maverick.

There’s a lot of potential for how voice synthesis can be applied in the advertising industry too. So far, AI voice has typically been combined with AI faces to create deepfakes that drive the campaign idea. For example, the 2021 Cannes Lions for Good Grand Prix went to a campaign in which murdered journalist Javier Valdez Cardenas was brought back to life in videos to directly address the Mexican president, demanding justice for himself and his colleagues, and demanding free speech for everyone.

So far, AI voice has typically been combined with AI faces to create deepfakes that drive the campaign idea.

We should keep in mind that voice synthesis doesn't need to be the star of the show like Warhol’s voice is in the docu-series. Instead, voice synthesis can be a backup act working tirelessly in the background to simply make our marketing workflows faster, easier, and smarter, and to produce better results.

For example, template-based creative automation solutions are already enabling brands to generate hundreds of video ad versions for various markets and channels. But now, with the added ability to include voice-overs with multiple speakers in multiple languages, marketers can make their creative more impactful, helping visual content speak louder – quite literally.

This increased flexibility also encourages more agile creativity, providing more space for experimentation and iteration. For brands always striving to optimize their ads for the best performance, this can be a game-changer.

Still Speaking Up

Credits
powered by Source





Unlock full credits and more with a Source + shots membership.

Credits powered by Source
Above: Still Speaking Out, 2021 Cannes Lions for Good Grand Prix winner.


So what now?

At the rate that voice synthesis technology is developing worldwide, these options will be available to creatives in an even more convincing and cost-effective way in the very near future.

We need to be vigilant of the ethical aspects around [voice synthesis] so that it won’t be misused. 

The creative potential for voice synthesis is endless. And the technology for this is progressing at a breakneck pace. At the same time, we need to be vigilant of the ethical aspects around this technology so that it won’t be misused. 

Once we take this into account, we can fully embrace what AI-generated voices can do for advertising: super-efficient, super-realistic storytelling.

Share