Databases: why the acquisition by Databricks of the AI ​​startup MosaicML foreshadows a gigantic battle



Naveen Rao, co-founder and CEO of MosaicML, and Hanlin Tang, co-founder and CTO. The company’s training technologies are applied to “expert building”, using large language models (LLMs) to process data. MosaicML

On Monday, Databricks, a 10-year-old software company based in San Francisco, announced that it would acquire MosaicML, a young company (three years of seniority), for an amount of 1.3 billion dollars. .

This move demonstrates not only the fervor for generative artificial intelligence, but also the changing nature of the cloud database market.

MosaicML, whose staff are made up of semiconductor veterans, has developed a program called Composer. Composer makes it easy and affordable to take any standard version of artificial intelligence programs like OpenAI’s GPT and dramatically speed up their development. And this by working on the training of a neural network.

“Neural networks can be considered as a database”

This year, the company launched commercial cloud-based services, through which companies can, for a fee, train a neural network and perform inference, that is, predictions in response to queries from users.

However, the deepest element of MosaicML’s approach implies that the traditional relational database could be completely reinvented.

“Neural network models can actually be thought of as a kind of database, especially when it comes to generative models,” said Naveen Rao, co-founder and CEO of MosaicML, in an interview with ZDNET before signing the agreement.

“Schema is discovered from data”

“The schema is discovered from the data, it produces a latent representation based on the data”

“At a very high level, what a database is is a collection of points that are usually very structured, so typically rows and columns of some sort of data, and then, based on of that data, there’s a diagram that you organize it on,” Rao explained.

Unlike a traditional relational database, such as Oracle’s, or a document database, such as MongdoDB’s, Rao said, where the schema is prebuilt, with a large language model, “the schema is discovered from the data, it produces a latent representation based on the data”. And the query is also flexible, unlike fixed searches in a database like SQL.

Using a prompt to make requests

“In effect, adds Mr. Rao, you take a database, you relax the constraints on its inputs, its schema and its outputs. In the form of a large linguistic model, such a database can handle large quantities of data that eludes traditional structured data stores.

“I can ingest a whole series of books by an author, and I can interrogate the ideas and the relationships within those books, which you can’t do with just text,” Rao said.

By using prompts in an LLM, it is possible to provide flexible ways to query the database. “When you query it the right way, you get something from the context created by the prompt,” says Rao. “So it’s possible to query aspects of the original data from that context, which is a pretty broad concept that can apply to many things. And I think that’s actually the reason for which these technologies are very important”.

The link between Big Data and artificial intelligence

MosaicML’s work is part of a broader movement to make generative AI programs, like ChatGPT, more relevant for business purposes.

For example, Snorkel, a three-year-old startup also based in San Francisco, offers tools that allow companies to write functions that automatically create labeled training data for AI models. The biggest AI models being neural networks like OpenAI’s GPT-4.

Another startup, OctoML, last week unveiled a service to make inference work easier.

The acquisition of Databricks enables MosaicML to enter the non-relational database market, which for several years has been changing the data storage paradigm beyond rows and columns.

This includes the Hadoop Data Lake, the techniques for exploiting it, and Apache Spark’s “map and reduce” paradigm, of which Databricks is the main promoter. The market also includes data streaming technologies, where data storage can, in some sense, be within the data stream itself. Known as “data in motion”, this technology is used by Apache Kafka software, promoted by Confluent.

Smaller, More Efficient Models: The Beginnings of Moore’s Law of AI

MosaicML, which raised $64 million before the transaction, is aimed at companies whose language models aren’t so much generalist in the form of ChatGPT as domain-specific use cases. This is what Mr. Rao calls “expert constructions”.

The dominant trend in artificial intelligence, including generative AI, has been to build increasingly general programs capable of handling tasks in everything from video games to online chats. , through the writing of poems.

The excitement around ChatGPT shows how compelling such a general program can be when it can be used to handle an unlimited number of requests. Yet the use of AI, by individuals and organizations, will likely be dominated for a long time to come by much more targeted approaches because they can be much more effective.

“I can build a smaller model for a particular domain that greatly outperforms a larger model,” Rao told ZDNET.

MosaicML has made a name for itself by demonstrating its prowess in MLPerf benchmark tests that show how quickly a neural network can be trained. One of the secrets to accelerating AI is the observation that smaller neural networks, built with more care, can be more efficient.

This idea was explored in depth in a 2019 paper by MIT scientists Jonathan Frankle and Michael Carbin, which won Best Paper that year at the International Conference on Learning Representations. The paper introduced the “lottery ticket hypothesis,” the notion that every large neural network contains “subnets” that can be just as accurate as the total network, but with less computational effort.

Frankle and Carbin served as advisors to MosaicML.

An optimal balance between the amount of training data and the size of a neural network

MosaicML also explicitly draws on techniques explored by Google’s DeepMind, which show that there is an optimal balance between the amount of training data and the size of a neural network. By doubling the amount of training data, it is possible to make a smaller network much more accurate than a larger network of the same type.

All of these efficiencies are summed up by Rao in what he calls a kind of Moore’s Law of network acceleration. Moore’s Law is the empirical rule of semiconductors which basically postulated that the amount of transistors in a chip would double every 18 months, at constant cost. It was this economic miracle that made possible the PC revolution, then the smartphone revolution.

In Rao’s version, neural networks can become four times faster with each generation, simply by applying the tricks of the calculation with the MosaicML Composer tool.

The market for training AI models on business data

Such an approach allows us to draw several surprising lessons. First, contrary to the belief that machine learning forms of AI require massive amounts of data, it may be that smaller datasets can perform well if applied in the optimal balance between data and the model, like the work of DeepMind. In other words, really big data might not be the best.

Unlike gigantic neural networks generic such as GPT-3, which is trained on everything on the Internet, smaller networks can be the repository of a company’s unique knowledge of its domain.

“Our infrastructure almost becomes the background for building these kinds of networks on people’s data,” Rao explained. “And that’s why people have to build their own models.”

“If you’re a bank or an intelligence agency, you can’t use GPT-3 because it was trained on Reddit, it trained a bunch of data that might even contain personally identifiable information, and it could contain data that has not been explicitly authorized for use,” Rao said.

Open source LLMs

It is for this reason that MosaicML participated in the campaign to make open-source models of large language models available, so that customers know what kind of program is acting on their data. It’s a view shared by other generative AI leaders, such as Stability.ai Founder and CEO Emad Mostaque, who told ZDNET in May, “It’s impossible to use models black box” for the world’s most valuable data, including corporate data.

Last Thursday, MosaicML released its latest version of a language model containing 30 billion parameters, or neural weights, called MPT-30B, as open source. The company claims that MPT-30B surpasses the quality of OpenAI’s GPT-3. Since the company released its language models as open source in early May, it has seen more than two million language model downloads.



Source link -97