In recent months there has been a groundswell of anecdotal evidence and general murmurings concerning a decline in the quality of ChatGPT responses. A team of researchers from Stanford and UC Berkeley decided to determine whether there was indeed degradation and come up with metrics to quantity the scale of detrimental change. To cut a long story short, the dive in ChatGPT quality certainly wasn't imagined.
Three distinguished academics, Matei Zaharia, Lingjiao Chen, and James Zou, were behind the recently published research paper How Is ChatGPT's Behavior Changing Over Time? (PDF) Earlier today, Computer Science Professor at UC Berkeley, Zaharia, took to Twitter to share the findings. He startlingly highlighted that "GPT -4's success rate on 'is this number prime? think step by step' fell from 97.6% to 2.4% from March to June."
GPT-4 became generally available about two weeks ago and was championed by OpenAI as its most advanced and capable model. It was quickly released to paying API developers, claiming it could power a range of new innovative AI products. Therefore, it is sad and surprising that the new study finds it so wanting of quality responses in the face of some pretty straightforward queries.
We have already given an example of GPT-4's superlative failure rate in the above prime number queries. The research team designed tasks to measure the following qualitative aspects of ChatGPT's underlying large language models (LLMs) GPT-4 and GPT-3.5. Tasks fall into four categories, measuring a diverse range of AI skills while being relatively simple to evaluate for performance.
- Solving math problems
- Answering sensitive questions
- Code generation
- Visual reasoning
An overview of the performance of the Open AI LLMs is provided in the chart below. The researchers quantified GPT-4 and GPT-3.5 releases across their March 2023 and June 2023 releases.
It is clearly illustrated that the "same" LLM service answers queries quite differently over time. Significant differences are seen over this relatively short period. It remains unclear how these LLMs are updated and if changes to improve some aspects of their performance can negatively impact others. See how much 'worse' the newest version of GPT-4 is compared to the March version in three testing categories. It only enjoys a win of a small margin in visual reasoning.
Some may be unbothered about the variable quality observed in the 'same versions' of these LLMs. However, the researchers note, "Due to the popularity of ChatGPT, both GPT-4 and GPT-3.5 have been widely adopted by individual users and a number of businesses." Therefore, it isn't beyond the bounds of possibility that some GPT-generated information can affect your life.
The researchers have voiced their intent to continue to assess GPT versions in a longer study. Perhaps Open AI should monitor and publish its own regular quality checks for its paying customers. If it can't be clearer about this, it may be necessary for business or governmental organizations to keep an check on some basic quality metrics for these LLMs, which can have significant commercial and research impacts.
No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one.Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before.July 13, 2023
AI and LLM tech isn't a stranger to surprising issues, and with the industry's data pilfering claims and other PR quagmires, it currently seems to be the latest 'wild west' frontier on connected life and commerce.