- read

RealTalk: We Replicated A Real Person’s Voice With AI

Dessa 16 35

Via Medium

RealTalk: We Replicated A Real Person’s Voice With AI

Today we’re excited to announce that three Machine Learning Engineers at Dessa; Hashiam Kadhim, Rayhane Mama, and Joseph Palermo; have built the most lifelike AI simulation of a voice we’ve heard to…

…and it’s the voice of Joe Rogan. Disclaimer: Rogan didn’t actually endorse our work like this. Similarly, RealTalk is not an endorsement of Rogan’s podcast or opinions.

Today we’re excited to announce that three Machine Learning Engineers at Dessa; Hashiam Kadhim, Rayhane Mama, and Joseph Palermo; have built the most lifelike AI simulation of a voice we’ve heard to date.

It’s the voice of someone you’ve probably heard of before: Joe Rogan. (For those who haven’t, Joe Rogan is the creator and host one of the world’s most popular podcasts, which to date has nearly 1300 episodes and counting.)

Obviously, something like this has to be heard to be believed. So without further ado, here

100% of the following audio was generated from the machine learning model using only text input. This includes the breaths, ‘um’s and ‘ah’s, and all other noises.

The team produced their facsimile of Rogan’s voice using a text-to-speech deep learning system they built called RealTalk, which generates life-like speech using only text as its input.

Bananas, right?

If you’re like us, and specifically, like our Principal ML Architect, Alex Krizhevsky, you might be thinking that it’s “one of the more impressive things I’ve seen yet in artificial intelligence.” Alex also noted that the work suggests that “Human-like speech synthesis may soon be a reality everywhere.”

What Does This Mean? Considering Societal Impact

It’s pretty surreal for our engineers to say they’ve created a life-like replica of Joe Rogan’s voice using AI. Not to mention the fact that the model would be capable of producing a replica of anyone’s voice, given sufficient data.

As practitioners focussed on AI’s real-world impact, we’re especially aware that we need to be talking about what this means for society.

Because clearly, the implications for synthetic media technologies like speech synthesis are massive. On top of that, the potential outcomes of synthetic media could impact everyone. Poor consumers and rich consumers. Enterprises and governments.

Right now, technical expertise, ingenuity, computing power and data are required to make models like RealTalk perform well. So not just anyone can go out and do it.

But in the next few years (or sooner), we could see the technology advance to the point where only a few seconds of audio are needed to create a life-like replica of anyone’s voice on the planet.

It’s pretty f*cking scary.

Here are some examples of how the technology could be used nefariously:

  • Spam callers impersonating a family member to obtain personal information
  • Impersonation for the purposes of bullying or harassment
  • Gaining entrance to high security clearance areas by impersonating a government official
  • An ‘audio deepfake’ of a politician being used to manipulate election results or cause a social uprising

Obviously, though, not everything is doom and gloom. There are also really positive things that could come from realistic speech synthesis:

  • Talking to a voice assistant in a way that’s as natural as talking to a friend
  • Personalized voice applications — for instance, a workout app that contains a personalized coaching session from a celebrity
  • Improved accessibility options for people that communicate through text-to-speech devices, for example, people with Lou Gehrig’s disease
  • Automating voice dubbing for any media and in any language

As the recent report “The Malicious Uses of Artificial Intelligence” by Oxford’s Future of Humanity Institute notes, new advancements in artificial intelligence not only expand existing threats, but also create new ones.

We won’t pretend to have all the answers about how to build this technology ethically. At the same time, we think it will be inevitably built and increasingly implemented into our world over the coming years.

So in addition to raising awareness and acknowledging these issues, we want to share this work as a way of starting a conversation on speech synthesis that must be had.

Everyone should know what kinds of things are possible with the development of speech synthesis technologies. As we’ve seen with visual deepfakes, public awareness and dialogue also pushes governments, policymakers and lawmakers to take action and develop countermeasures swiftly.

To work on things like this responsibly, we think the public should be made aware of speech synthesis models’ implications before we release anything open source. Because of this, at this time we will not be releasing our research, model or datasets publicly.

Next steps

We encourage everyone reading this to remember that speech synthesis is getting better and better everyday. On the horizon, it’s not outlandish to think that the implications we mentioned (and many more) will make their way into society.

So pay attention! Knowledge is power, and we encourage individuals, companies, and governments to join the conversation on how we can responsibly implement these technologies.

Curious about how RealTalk was built? Check out Pt. II of the blog post here for a technical overview of the text-to-speech synthesis model, data, and more.

We also encourage you to check out a Turing Test-style game the RealTalk team built to showcase the naturalness and intelligibility of this model, which can be found at www.fakejoerogan.com.

Please note that this project does not suggest that we endorse the views and opinions of Joe Rogan. Joe was selected as a demonstrative model for the purposes of displaying the capability of this technology.