Meta: Voicebox AI reproduces the voices of your friends and loved ones


As AI chatbots and AI art generators grow in popularity by the hour, some of the biggest players in the industry are trying to keep up with new tools.

Meta has therefore just presented Voicebox, a speech generator with artificial intelligence so powerful that the company claims to surpass all existing models.

Voicebox is powerful enough to generate voices as easily as ChatGPT can generate text and Bing or Dall-E 2 can create images. Although the system is not yet available, Meta has made demos available to anyone who wants to learn more about Voicebox.

Match the audio style of a sample

The system could be used in audio editing by content creators and editors, for example, as its voice generation results in natural-sounding audio clips. And it’s versatile enough to intelligently remove noise from voice clips, like barking dogs, and regenerate the voice without missing a beat.

One of Voicebox’s capabilities is that it can match the audio style of a sample and generate text-to-speech clips.

The new generative AI tool can solve tasks through in-context learning, so it can process text it’s never received before and correctly generate context and inflections, just like a person would. would read, using existing knowledge to learn and meet new challenges.

A binary classification model can distinguish between real voice and that generated by Voicebox

The ethical and legal implications of this revolutionary tool are significant. Anyone could generate audio clips from recordings of someone’s voice without their permission and claim to make them say whatever they want.

In the research paper, Meta claims that a binary classification model can distinguish between real voice and that generated by Voicebox. Anyway, the system is not (yet) accessible to the public, no one has yet tested the performance of the model.

Meta trained Voicebox on 60,000 hours of English audiobooks and 50,000 hours of multilingual audiobooks in six languages ​​for peak performance. This training allows him to perform multilingual speech synthesis without training, to denoise speech, to style it, to edit it and to generate various speech samples.

In search of performance

In an article published by Meta AI, the company claims that it can generate various audio samples 20 times faster than Microsoft’s VALL-E and more intelligible.

In addition to being faster and making fewer errors than competitors, Meta claims that Voicebox can convert written text into spoken words in one or more languages ​​without being specifically trained for each language separately.

Compared to the previous model, YourTTS, Voicebox reduced the average word error rate from 10.9% to 5.2% and increased audio similarity from 0.335 to 0.481.


Source: “ZDNet.com”



Source link -97