Wikipedia offers insight into the legal headaches of generative AIs with copyright


The arrival of generative artificial intelligences, such as ChatGPT for text and Midjourney for images, raises heavy legal questions, particularly in terms of copyright. The Wikimedia Foundation, which maintains the Wikipedia encyclopedia, outlines the complex legal ramifications behind this topic.

ChatGPT does not generate text by the operation of the Holy Spirit. Its language model, which constitutes its technical base (GPT-3.5, then GPT-4), has been trained on hundreds of billions of words. In addition, it is also a chatbot that can fetch information on the net, like Wikipedia. This is what can be seen, for example, in the presentation of GPT-4.

This use of the contents of the encyclopedia – written by Internet users collectively – raises legal questions: OpenAI, the American company behind ChatGPT, can it freely draw from the pages of the site to train its models and to browse its web pages, at the request of the user, in order to generate text?

This is a question that echoes the one that we see emerging more and more from the side of AIs specialized in illustrations, like DALL-E, Stable Diffusion and Midjourney – the three most generative artificial intelligences known. Their algorithms have been trained on existing content to know how to respond to a request with their interpretation.

It turns out that more and more artists are denouncing the arrival of these tools. First, because they compete head-on, which threatens their business. Then because these models could have been trained with drawings, paintings and visuals protected by copyright, without any authorization.

For Wikimedia, using its content is not a concern… a priori

What about a collective work like Wikipedia? This issue is being addressed by the Wikimedia Foundation, which oversees the encyclopedia and related projects. On March 23, she uploaded an article in which she offers a first legal analysis of copyright in relation to ChatGPT, through the prism of American law.

If Wikimedia warns that it is a ” preliminary point of view », therefore likely to evolve, an orientation emerges. The exploitation by OpenAI and ChatGPT (and by anyone more generally, be it a chatbot, a company or an individual) of the contents appearing in its spaces would not be a difficulty at first sight.

For further

We already see it on the web: Google also operates Wikipedia. This is also the case for other companies. The encyclopedia is a valuable mine for enriching entire sections of a search engine to help voice assistants find information and then transcribe it orally. These uses are massive.

Idil Keysan
Wikimedia is beginning to explore the legal issues of generative AI. // Source: Idil Keysan

This is due to the nature of the legal framework that applies to content, precisely. On Wikipedia, the texts, images, sounds, videos and other formats are, for the vast majority, managed by the Creative Commons license. Specifically, it’s the Attribution-ShareAlike 3.0 license, which is one of the most permissive.

Creative Commons licenses allow free reproduction and reuse, so AI programs like ChatGPT can copy text from a Wikipedia article or an image from Wikimedia Commons “Advice Wikimedia in its preliminary observation. Anyone can therefore retrieve the text and use it as they wish, without paying anything.

However, there is one point on which Wikimedia is hesitant: can the massive copying of content lead to a violation of the Creative Commons license, if certain specificities of this framework are not respected? In this license, in principle, it is necessary to attribute and share under the same conditions. Two conditions which do not seem to be applied with application.

In the case of attribution, this generally involves citing the author and providing a link to the source. As for sharing under the same conditions, the idea is that the new content uses the same license. In ChatGPT, we do not see these elements when interacting with the chatbot. But in other integrations, like Bing, it’s better sourced.

In addition to the question of the nature of the “input” data (is it protected? Can it be considered as fair use? Etc.), there is also what happens at the “output” (is it covered by copyright? If so, who has the rights? Is it subject to the same license as the input data? Etc.)

The issue of compliance with the license, at the output level, that is to say on the text generated by the AI, is also delicate for another reason. ChatGPT does not just copy and paste Wikipedia: in its response, it can rewrite parts of it, while relying on other sources, for a mixed rendering. In fact, the part coming from Wikipedia is more or less diluted.

Overall, it’s more likely than not that training systems using copyrighted data are covered by fair use in the United States, if current precedents are to be believed, but the uncertainty is great “Warns Wikimedia. This assumption only applies to the USA. In France, there is no fair usebut exceptions.

The foundation admits that the legal issues around generative AIs, which are trained from data whose status is variable, are yet to be determined and clarified – including the volume of data involved to feed the algorithms. This can also be seen on a peripheral subject: can the creations of an AI be protected by copyright? Today is no.

These issues, and others sketched out by Wikimedia, are still far from being settled, especially since legislation differs from one country to another. ” All possibilities remain open as key AI and copyright cases remain unresolved “warns the foundation. A headache for lawyers and a red line for artists.


Do you want to know everything about the mobility of tomorrow, from electric cars to pedelecs? Subscribe now to our Watt Else newsletter!

Understand everything about experimenting with OpenAI, ChatGPT



Source link -100