Common Corpus: a corpus of copyright-free texts to feed LLMs


Training large language models or generative AI models cannot be done without the use of large corpora of texts or images. This is notably the subject of the agreement recently concluded between Le Monde and OpenAI as well as the subject of the discord between OpenAI and the New York Times: access to content that can be used to train models.

It is within this framework of reflection that the Common Corpus project, led by Pierre Carl Langlais and his startup Pleias, falls. Coming from the world of research and specializing in information and communication sciences, he had already been confronted in the past with the subject of royalty-free documents, for example in his work on the Numapresse project, a digitization project. French newspapers from the 19th century in order to carry out analyzes of their contents. For the researcher, it is “impossible to separate artificial intelligence from the data used to train it. And this data is culture.”

Published during the day on Wednesday on the HuggingFace platform, this corpus of text brings together a volume of 500 billion words in different languages, all guaranteed to be free of copyright. “Originally, we had already published a first entirely French corpus a few months ago, which included around 80 billion words. We realized that there was strong interest in the project, so we wanted to move on to something bigger” Pierre Carl Langlais told ZDNET.

The right data for a good model

The final result is not limited to a single language and counts around 200 billion words for the English language, 100 billion for the French language, 30 billion in German, around twenty billion in Dutch and other languages.

The corpus is mainly made up of old texts, selected and chosen by the initiators of the project to limit the risk with regard to copyright as much as possible: “It’s a lot of work, but we can already rely on the work carried out by several digital libraries in this area. We also used data indexed by projects like Internet Archive, but avoiding texts published after 1884 for example, to avoid using texts subject to copyright” explains Pierre Carl Langlais. The end result is essentially made up. of long texts, often in PDF form, which makes it an ideal tool for training an LLM on document analysis tasks for example, but also the production of long texts, a pitfall which many current language models address still have difficulties.

The objective of this corpus is to become a common, a freely shared resource in order to “enable the emergence of alternative actors” who will be able to rely on Common Corpus to train their own language model. “We can clearly see today that the secret of a good model depends a lot on the data used to train it. And today there are a lot of debates on questions of access to the corpus and the main players use closed corpora to train their models, without us really knowing what constitutes them” summarizes Pierre Carl Langlais.

Upstream work

The development of this common corpus aims not only to avoid the legal problems of reusing texts covered by copyright, but also to better control the production of models. “We see, for example, many models trained on data published on the web, but which are therefore exposed to hateful or pornographic content. The main market players therefore find themselves forced to carry out a posteriori control on the texts generated, and “That doesn’t seem like a good method to me. If we can’t trace the origin of the data used to train the model, it’s even harder to control what the models will generate.”

The project aims to become a common one, Pierre Carl Langlais naturally invites other organizations and individuals interested in the approach to contribute to the enrichment of the corpus. The researcher was able to count on the support of the state start-up LANGU:IA, as well as the helping hand of Scaleway for hosting the project.

And the researcher intends to lay the first stones of a broader collaboration with other French or foreign actors: “I am already working with organizations like HuggingFace, Eleuther, Occiglot or NomicAI. For the moment there is nothing formalized, we simply share similar values. But anyone can join us and help us identify new bodies of copyright-free texts aimed at feeding the project” explains Pierre Carl Langlais.



Source link -97