GPT-4: a joint project by Google and OpenAI to improve the precision of responses

Image: alengo/Getty Images.

One of the biggest weaknesses of generative AI programs like ChatGPT is their temporal limitation. ChatGPT, for example, was long limited to knowledge before September 2021, before OpenAI announced an update to GPT-4 allowing the latter to access data until April 2023.

To improve these programs, artificial intelligence engineers are working on ways to allow them to reliably access constantly evolving data.

With that in mind, Google and OpenAI released a joint project this month, called “FreshLLM,” that incentivizes GPT-4 to use insights gleaned from Google searches. The core of FreshLLM is a new method of prompting a language model, called “FreshPrompt,” that includes search engine results.

How it works ?

By including Google’s top search results in GPT-4’s input prompt, and then showing a valid answer to a query based on those search results, GPT-4 is incentivized to use searchable evidence on the web to develop its result.

“FreshPrompt significantly improves performance [des programmes d’IA générative] compared to competing approaches that use search engines,” emphasize Google’s Tu Vu and his team.

FreshPrompt is only part of the story, however. To test the performance of GPT-4 and its competitors in using data from the internet, Tu Vu and his team had to develop a list of questions containing facts and news.

600 diverse and varied questions

To do this, the team, helped by external collaborations, drafted questions on “the evolution of the world”. The questions were first chosen to mobilize knowledge “fresh” – that is, requiring “knowledge that has changed recently or concerning new events.” They also had to be plausible”: it had to be “plausible that a real person would type this question into their search engine”.

Some of the 600 questions created by Google and OpenAI researchers. Image: Google, OpenAI.

These 600 questions, grouped under the name “FreshAQ”, range from “Has Virginia Woolf’s novel about the Ramsay family entered the public domain in the United States?” “, which calls for a fixed answer, to “What is Brad Pitt’s last film as an actor?” ”, the answer to which can quickly vary. Most, but not all, answers come from Wikipedia.

The project’s GitHub code links to a Google Spreadsheets document which brings together all of the FreshQA questions. To get an idea of ​​the multiple themes covered, you can take a look at the questions. For example, we go from “Which author sold the greatest number of novels in the United States last year according to Publishers Weekly?” » (the answer is Colleen Hoover) to “How many accounts have exceeded 100 million followers on Instagram? » (38).

To further challenge AI, there are also misleading questions presenting false truths. For example: “In what year did the first human being land on Mars?” »

Significant improvements

The major language models (LLMs) tested, including GPT-4 and Pathways Language Model (PaLM), Google’s LLM, were challenged by FreshQA questions, as might be expected. But with the help of FreshPrompt, the results were significantly improved. Tu Vu and his team specify that this result comes mainly from the lack of updating of LLM information, which thus produces answers that are sometimes obsolete. Furthermore, many of them refuse to give an answer.

On GPT-4, the addition of FreshPrompt, the team says, “significantly improves the accuracy of answers to FreshQA questions”, particularly because this technique “significantly reduces hallucinations and outdated answers”. On questions relating to facts after 2022, the difference in results is abysmal: we go from 8% accuracy to 70.2%. On all FreshQA questions, which include older facts, the difference remains notable, going from 28.6% to 75.6%.

For misleading questions including false truths, the difference is also stark: using FreshPrompt, GPT-4 went from 33.9% correct answers to 71%. Certainly, this means that there are still errors in almost a third of cases.

The Tu Vu team also found that FreshPrompt outperformed other searches that also used search engine queries to “augment” language models. This includes, for example,, a combination of GPT-3.5 and Bing Search. The average accuracy of Perplexity, across all FreshQA questions, is 52.2% – barely better than chance – while GPT-4, using FreshPrompt, achieved an accuracy of 75.6%.

Among the important differences noted by the team is the number of pieces of evidence included in FreshPrompt from internet research. In general, the more items the better for a correct answer. “Our results suggest that the number of evidence retrieved for each question is the most important ingredient for achieving the highest accuracy. »

For the Tu Vu team, there are still real challenges ahead moving forward. Notably, constantly updating FreshPrompt means checking that responses are still relevant, and that takes a lot of time. The team hopes that the free software community can help, or that the update can be automated by generative AI. For now, however, she’s committed to keeping FreshQA fresh.


Source link -97