Can Meta’s artificial intelligence lip read?


It’s a widely known fact that people understand speech not only by listening with their ears, but also by picking up cues from the lip movements they observe in speakers. Likewise, the combination of visual observation and audio could help a computer better analyze human speech. Computer programs can lip read, in a sense, although this is a laborious task to perform.

Recent work from Meta, the parent company of Facebook, Instagram, and WhatsApp, suggests a more efficient way to someday get computers to lip read.

Last Friday, Artificial Intelligence (AI) researchers at Meta released a report in which they succeeded in drastically reducing the effort required to design software that can analyze words from the movements of speakers’ lips in videos. recorded. This work has also made it possible to use lip-reading technology to significantly improve speech recognition in noisy environments.

The program is “75% more accurate than the best audio and visual speech recognition systems (which use both sound and images of the speaker to understand what he is saying),” the authors say.

The road to self-supervision

Of course, we are thinking here of the metaverse. Not only could the program be used for instant translation, but it could also, one day, “help generate realistic lip movements in virtual reality avatars, in order to provide a real sense of presence – that feeling of” to be there with someone, even if they are on the other side of the world ”.

This work represents progress on two fronts. The first is self-supervised learning, which avoids specific clues, such as text transcriptions, and lets the program spontaneously guess the structure of the data. The other axis of development is that of multimodal neural networks, which combine data of different nature, so that they are mutually reinforcing.

The result, called AV-HuBERT, the “AV” meaning audiovisual, the “Hu” meaning “hidden unit”, combines auditory and visual signals to detect words from the movements of the lips. Lead author Bowen Shi and colleagues Wei-Ning Hsu, Kushal Lakhotia and Abdelrahman Mohamed, Facebook, explained their work in the article titled “Learning Audio-Visual Speech Representation By Masked Multimodal Cluster Prediction”. The authors also wrote a blog post, arguably more digestible.

As the researchers explain, previous work was also multimodal, combining visual data, video images, with audio data, waveform snippets, in order to train a neural network to predict their match. But these programs tended to rely on additional pre-prepared clues, such as transcribing videos of speakers into sentences of text that then served as labels. The new works take the road of self-supervision, spontaneously assembling models without an external structure.

“This is the first system to jointly model speech and lip movements from unlabeled data – raw videos that have yet to be transcribed,” the authors write in their blog post.

Merged approach

The AV-HuBERT program they invented builds on an audio-only program called HuBERT and introduced last year by Wei-NingHsu and his colleagues. As its name suggests, HuBERT uses the Transformer bidirectional neural network approach developed at Google in 2018.

By “masking” parts of an audio recording, that is, by leaving out sections of an audio waveform, the HuBERT neural network, in its training phase, had to reconstruct the pieces of audio. ‘audio that go with each other. Now, in AV-HuBERT, researchers want to “merge” pieces of audio with images from videos of people talking. The training phase of the neural network essentially takes place in two stages. First, as in the case of the original HuBERT, they use the attention approach to mask the audio, and then group the audio waveforms into clusters, i.e. groups of examples which are in a certain way close to each other in their attributes.

These groupings then become a target for the second stage of the neural network. The multimodal part of AV-HuBERT simultaneously masks the images of the speakers’ lips and the audio waveform, then attempts to match them to the clusters established in the first wave. In this way, the program calculates which lip configurations correspond to which audio waveforms, thereby “learning” the correlation between mouth movement and audio output.

It is, in fact, a self-supervised approach that determines the structure without explicit cues. Merging means that the attention to the images and the attention to the audio waveforms reinforce each other to produce clusters greater than either one or the other could produce on its own. These clusters become the “target” of subsequent tasks, such as lip reading and speech recognition.

As the authors explain, “AV-HuBERT simultaneously captures linguistic and phonetic information for unmasked regions from the flow of lip movements and audio streams into its latent representations, then encodes their long-range temporal relationships to resolve the masked prediction task ”.

Lip Reading Sentences 3 Languages

Once AV-HuBERT has been self-trained in this manner, the authors fine-tune by introducing actual tagged video, hours of video, with formal transcriptions that tell the machine where the words are in the video.

The primary dataset used to test and train the AV-HuBERT program is LRS3, developed in 2018 by Triantafyllos Afouras and colleagues at Oxford, which is “the largest sentence-level lip reading dataset publicly available at this day. It consists of over 400 hours of video, taken from TED & TEDx talks in English from YouTube ”.

Thanks to AV-HuBERT’s self-supervised training, it can predict words from speaker videos more effectively than any previous attempt, the researchers write. But, more important than the raw score is the drastic reduction in the amount of data needed to train the program. “AV-HuBERT achieves the state of the art using 433 hours of text transcription, two orders of magnitude less than the 31,000 hours of tagged data used in the previous best approach,” they write.

With much less data required, it is possible to perform lip reading tasks on languages ​​that have much less data than others, so-called low-resource languages. (Think of languages ​​other than English, French and German, for example). The authors point out that “in future work, AV-HuBERT can be applied to multilingual lip reading in low-resource languages” and that the same “approach can be extended to other applications of visual representation of speech, such as speech enhancement and generation ”.

What about ambient noise?

The researchers added to their results a second article published last week describing the use of AV-HuBERT for automatic speech recognition. Here the focus is on how to improve the parsing of speech in a noisy environment.

Speech recognition “deployed in meeting scenarios is prone to conversation noise, while speech recognition used in a home environment naturally encounters music, cooking or vacuuming noises.” Their question is whether AV-HuBERT can overcome these ambient noise.

Researchers mix sound clips with AV-HuBERT’s sample video images and audio waveforms during training. The result, they write, is that the program becomes good at getting around the noise. So much so that AV-HuBERT makes it possible to reduce the word error rate by 50%, that is to say the proportion of erroneous words, compared to previous voice recognition systems. “Our future work includes the application of audiovisual speech recognition in real, resource-limited and multilingual contexts,” they write.

The idea that AI is now better than humans at lip-reading has been the subject of earlier work on AI in recent years. The word error rate in the best performance of AV-HuBERT is, in fact, much better than that of professional human lip readers, at 26.9%. Apparently the best result for people who can lip read is only 40% (four out of ten times they are wrong). Obviously, for things like transcribing conferences after the fact, this could give software a boost.

In practice, however, there is a big drawback. It is in fact “simulating” lip reading. AV-HuBERT results are the result of a test on recorded video, not a real live conversation.

Source: ZDNet.com





Source link -97