VALL-E: With three seconds of audio, this new AI can simulate anyone’s voice


After DALL-E, here is VALL-E. Microsoft has unveiled its artificial intelligence-based text-to-speech model. The latter is able to reproduce a person’s voice with only three seconds of audio recording. A new revolution in the field of AI, relayed in particular by our colleagues fromArs-Technica.

A corpus of 60,000 hours of speech

VALL-E thus makes it possible to reproduce almost perfectly the tone of voice of an average person, while preserving the tone and emotion of the speaker. It is then possible to make the AI ​​recite any text. A kind of deepfake in the audio field, in short. The tool developed by Microsoft researchers can also be combined with the GPT3 textual model to generate speech independently.

To develop this “neural codec language model”, the developers relied on EnCodec technology, an audio codec created by Meta and revealed in October 2022, also based on AI. In order to reproduce the tones of a voice, VALL-E was trained on the LibriLight audio library. 60,000 hours of speech in English from more than 7,000 different speakers thus fed the knowledge base of the Microsoft model.

Atypical accents, weakness of VALL-E

Only limitation, VALL-E can sometimes mispronounce, forget or duplicate certain words. A bug explained by the very nature of the model used (autoregressive), which should not fail to be corrected in the next versions. Also, the AI ​​would have a hard time learning strong accents. Although the LibriLight audio library is diverse, it is not enough to digest all of the accents present around the globe. To correct this bias, VALL-E will simply have to diversify its knowledge base with numerous audio corpora. In the future, Microsoft researchers therefore expect “improve the performance of the model in terms of prosody and style of expression”.

For the most curious, Microsoft has put a demo site online to test the capabilities of VALL-E.

In a note on the ethics of their tool, the engineers warn of possible diversions: “Since VALL-E can synthesize speech retaining the identity of the speaker, it may carry potential risks of misuse of the model, such as impersonating voice identification or impersonating a specific speaker”, they anticipate. In the event that the model would come to be used publicly, “it should include a protocol to ensure the speaker approves the use of their voice”warn the developers.



Source link -98