Machine Learning: In search of an all-rounder algorithm


With data2vec, a research team from Meta/Facebook has announced an AI model that should be able to process written and spoken text in addition to visual tasks. The framework apparently gets by with a common algorithm and a uniform learning mechanism for the different modalities of the input. Those interested can view the code and examples on GitHub. According to the announcement, the name is based on the word2vec developed by Google, a text-specific neural network for predictions of word clusters with neighboring meanings from 2013. The basis of the new model is the basic version of a transformer that the research team led by Alexei Baevski used for image data, linguistic audio data and text had been pre-trained.

In a similar direction, Google is working on DeepMind Perceiver, a multimodal version of the Transformer, and the German company Aleph Alpha is creating multimodal AI models such as luminous, which are able to combine different types of data such as input in text and image form process. luminous was announced at the end of 2021 at the International Supercomputing Conference, which Heise reported on. In the case of data2vec, according to its research paper, the MetaAI team had started pre-training a Vision Transformer (ViT), which was still specifically designed for visual tasks. Without further modification, the same neural network should now also be able to handle speech recognition and NLP (Natural Language Processing).

The prediction method comes from self-supervised learning and works by gradually hiding parts of the model to be trained (masked prediction). A model learns in several training phases how to construct the representations of the data input via probabilities. In further steps, parts of the input are hidden (masked) and the system is thus gradually moved to (more or less plausible) adding the blanks (see Fig. 1). The team uses two neural networks, one containing the complete data set (teacher), the other having to fill in hidden areas (student).

MetaAI Research Paper on data2vec Pre-training, self-supervised learning across modalities

Schematic representation of how the training of the data2vec framework takes place in a teacher-student mode (Fig. 1).

(Image: MetaAI Research)

What is currently going on in AI research can best be described as a kind of new space race: In the USA, China and also in Europe, increasingly huge AI models with many billions of parameters are being created at ever shorter intervals are trained to capture unlabeled datasets and deliver contextual output. In the future, it should be possible to use images, text and spoken language in combination without needing different programs. Machines could thus come close to “understanding the world” and “perceiving the world”, since, according to the research teams involved, their ability to learn is increasingly approaching that of humans and they themselves develop contextual knowledge beyond the initial training – in the long term through independent observation of the world. This opens up space for numerous new applications and business areas, which, for example, go in the direction of augmented reality (AR).

In the past, models were still special machines trained for clearly definable use cases, such as for recognizing pedestrians in traffic, language assistants, machine translation or single-purpose applications intended for pure text processing, but the development is now going beyond that at breakneck speed. According to insiders, the future belongs to multimodality, i.e. processing different types of data and media in one machine. The deep neural networks required for this are increasingly being trained in the form of self-supervised or unsupervised learning.

The way there is lined with smaller and larger milestones, and US hyperscalers are strikingly spending a lot of money to drive development in the area. Mark Zuckerberg’s announcement of a Metaverse and the renaming of Facebook to Meta caused some malice on the Internet, since the playful computer game look of his marketing video makes the society-shattering potential of the development of AI scarcely comprehensible. What is at stake becomes more tangible in current research papers on the latest AI models that are just being launched.

Anyone who would like to know more about data2vec will find what they are looking for in the blog entry by Alexei Baevski’s meta research team or can view the recently published research paper. The data2vec models and code can be found on GitHub. There is material about the Perceiver on the DeepMind blog. Information on ongoing multimodal AI research in Europe can be found in a Heise article on the launch of OpenGPT-X, and the research on multimodal extension of generative models by adapter fine tuning (MAGMA) underlying the AI ​​models used by Aleph Alpha is now also available on arXiv.org.


(her)

To home page



Source link -64