An ethical AI competing with ChatGPT that does not infringe on copyright? This French start-up believes in it!


Vincent Mannessier

March 22, 2024 at 3:12 p.m.

2

Finally an ethical AI on the market?  © Anton Gvozdikov / Shutterstock

Finally an ethical AI on the market? © Anton Gvozdikov / Shutterstock

They knew it was impossible, and they did it anyway. While the giants of the AI ​​sector have always justified their aggressive data collection by the necessity of doing so, the French start-up Pleias shows that it is possible to train an LLM without encroaching on copyright .

This Wednesday, March 20, researchers released what they believe to be the largest LLM using only data and content in the public domain. The model, called Common Corpus, would be roughly equivalent to GPT-3, and was developed in collaboration with several European laboratories and with the support of the Ministry of Culture in France.

According to OpenAI, the task was however impossible

When the company or its executives are asked about the issue, OpenAI is adamant. The only way to develop ever more efficient artificial intelligence is to stop bothering to respect dated concepts like copyright. This is also the company’s official line of defense during the already numerous lawsuits that have been filed against it on the subject. Indeed, for her, since artificial intelligence will benefit humanity, using content for which it does not have the rights to train them is a legitimate use, which should not be contested. This position is also shared by other giants in the sector, with Google in the lead.

However, the development of “ethical” AI, at least on this issue, is not impossible, and several companies have been looking into the issue for years now. This is the case of Pleias, who developed Common Corpus, a model using only data available in the public domain, and who published it on Hugging Face, an open source AI platform.

It is the first model of its kind certified by the American organization Fairly Trained, which indicates ethically trained models. If Pleias is at the origin of the project and coordinates it, the start-up worked on its design in collaboration with other European organizations and with funding from the Ministry of Culture.

Sam Altman, president of OpenAI © TechCrunch

Sam Altman, president of OpenAI © TechCrunch

A corpus equivalent to that of GPT-3

With 500 billion tokens as a database, Common Corpus is still far behind the latest state-of-the-art models, since this roughly corresponds to what was used for GPT-3. While the effort must be commended, without forgetting the fact that such a vision is also a selling point, such a development process also has limits.

The first of these is linked to the public domain, precisely. The law may vary from country to country, but in France, a work enters the public domain 70 years after the death of its author. The training data is therefore largely very dated, and such a model cannot likely be linked directly to the Internet in the same way as its more advanced counterparts, which are not affected by these considerations.

It is of course possible to add texts and other works with the agreement of their authors, but the process is for the moment much more complex and laborious than doing without their opinion.

Sources: Wired, LePtiDigital



Source link -99