OpenAI is always looking for new data to train its language models. And it seems that Sam Altman’s firm has, with this idea, turned to YouTube, where it would have been widely used!
If artificial intelligence systems like ChatGPT seem so exceptional to us, it is because for several years they have ingested enormous quantities of data, thanks to which they are today machines capable of generating an exceptional number of contents, often of quality. But the problem is that the amount of data available and able to be used is finite. Companies in the sector must therefore be creative in order to find new ones elsewhere. This seems to be what OpenAI did by turning to YouTube!
OpenAI turned to YouTube
THE New York Times has been in open conflict with OpenAI for many months. So if the famous American newspaper can find potentially embarrassing information about the firm headed by Sam Altman, it will not hesitate to publish it. And that’s what he did, revealing in recent days that OpenAI would have recovered nearly 1 million hours of YouTube videos in order to develop its GPT-4 language model.
To do this, the Californian company would have used its Whisper tool, which notably allows audio and video to be transcribed into text, to recover the content in written format, which can then be ingested by GPT-4. It must be said that according to the other major American newspaper, the Wall Street Journalthe giants working on AI are currently short of quality data to improve their systems.
For Google, companies cannot train on data from YouTube
THE New York Times he believes that OpenAI had reached the end of quality data available for its AI from 2021. At that time, discussions would have already emerged on the possibility of turning to alternative resources such as videos, audiobooks or podcasts. Which ultimately would have been done, by opening the door to YouTube.
Contacted by The VergeGoogle, the parent company of YouTube, explained that it had heard of “ unconfirmed reports » indicating OpenAI activity on its platform. Spokesman Matt Bryant also made a point of reminding us that “ our robots.txt files and terms of service prohibit scraping or unauthorized downloading of content from YouTube. » A new legal front soon about to open for OpenAI?
Source : Engadget
2