Is 'GPT4 Getting Dumber' or our Evaluation Criteria?

July 25, 2023

I recently came across a paper titled "How Is ChatGPT's Behavior Changing Over Time?" that went viral on social media claiming that ChatGPT (and the underlying GPT models) are getting substantially worse at specific tasks over time. As an AI engineer who uses GPT and other LMs regularly, I was shocked by these claims and had to dive into the details. I found a flawed paper making exaggerated claims based on awful evaluation practices. Let's dive into what made this "research" such a problem!

LLMs Aren't Calculators or Mathematicians

The first outrageous claim in the paper is that GPT-4 has gone from 97.6% accuracy on a math test in March 2023 to just 2.4% accuracy in June 2023. Gasp! GPT has forgotten basic arithmetic!

Not so fast. The test was on determining if numbers are prime, which involves more mathematical reasoning than calculation. The authors prompted GPT to "think step-by-step" and show its work. In March, GPT nicely stepped through the logic to determine primality. But in June, it just answered "No" without showing steps.

The benefit of large language models has never been to replace calculators, mathematicians, or perform mathematical reasoning. Their strength lies in generating coherent natural language text, not numerical analysis or logic. Tests expecting human-level mathematical work from LLMs set them up for failure at a task they were never designed to handle in the first place.

LLMs like GPT are not human mathematicians or even calculators. They do not possess true mathematical reasoning capabilities or an understanding abstract concepts like primality. Asking them to show their work makes no more sense than asking a search engine to show its work. The June update seems to have made GPT correctly ignore this request rather than humour it.

To properly evaluate them, we need benchmarks focused on language tasks, not inappropriate ones like math proofs. Examples of valuable tests are summarising long articles, translating between languages, and answering broad questions based on large textual datasets. But evaluating their mathematical skills is as fruitless as testing a calculator on its ability to write poetry.

Quotes ≠ Broken Code

Another supposed sign of GPT's declining intelligence was a drop in the percentage of directly executable Python code it generated from 52% to 10%. But looking into the details reveals a silly mistake by the authors.

The June GPT wrapped the code in triple backticks as Markdown formatting, like this:

const greeting = "Hello World!";

Rather than a dumb mistake, this is GPT helpfully formatting code as a Markdown code block. The authors seem unaware of basic Markdown and falsely equate it with broken code. 🤦‍♂️

Once again, the evaluation methodology needs to be revised here. Production-ready code generation has never been GPT's purpose. The goal of LLMs is to generate natural language, not software systems. Expecting executor-ready code from ChatGPT reflects a fundamental misunderstanding of its capabilities. The system has prioritised formatting the code, which helps distinguish code blocks and is simple enough to run by removing the backticks.

It's a Black Box, But It's Still Improving

One valid criticism raised is that because ChatGPT is proprietary software rather than open-source research, users don't know when and how it gets updated. This lack of transparency means its behaviour can change without notice.

However, OpenAI has assured users that despite these opaque updates, each version continues to get more capable. For example, here is a quote from their July 2023 blog post:

"When we release new model versions, our top priority is to make newer models smarter across the board. We are targeting improvements on a large number of axes, such as instruction following, factual accuracy, and refusal behavior."

Their VP of Product, Peter Welinder said in a tweet:

"No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one. Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before."

So while the black box nature understandably breeds suspicion, those closely involved confirm it keeps getting better, not worse. We should judge it by what it can do, not how it's made.

Safety Isn't "Worse"

The researchers also claim GPT worsened because it went from answering 21% of "sensitive questions" in March to only 5% in June. Examples include asking how to make money illegally.

They conclude GPT is now "less helpful" in answering such dangerous questions. But not enabling illegal or unethical acts is an improvement, not a regression! The authors seem oblivious to safety considerations.

Responsibly declining to answer sensitive queries directly shows maturity, not weakness.

Good Evaluation Means Tasks Within Scope

Finally, part of why this paper goes so wrong is failing to match evaluation to system capabilities properly. As AI researchers, how can we design rigorous, fair tests?

The key is to deeply understand what an AI system was built for, and then construct targeted tests for those exact purposes. For large language models like ChatGPT, this means focusing on language generation tasks using diverse textual input data.

Appropriate benchmarks include summarisation, translation, conversational ability, and common sense reasoning based on text datasets. But judging them on mathematical proof, code execution (without parsing), or its ability to encourage illegal activity takes them wildly out of scope.

Let's measure AI for what it can do, not what it never aimed to do. With more thoughtful, grounded evaluation, we can cut through the hype and hysteria to understand these technologies' true capacities. Papers like this "ChatGPT is Getting Dumber" one only further confuse the picture through misguided testing.

The Danger of Extrapolating Limited Examples

One consistent flaw in papers like this is making broad claims about an AI's overall capabilities based on a tiny set of cherry-picked examples. The authors base sweeping conclusions about declines in intelligence on very narrow tests like prime factorisation and code formatting.

Performance on any single contrived test provides little insight into general intelligence. Imagine judging a human's overall acumen based only on their ability to do long division! Yet this is precisely the kind of inexperienced extrapolation we see repeatedly in AI research.

Rather than obsolete benchmarks, we need comprehensive testing covering diverse scenarios representative of real-world use cases. Isolated anecdotes can generate headlines but tell us little about actual system abilities. The next time you see a paper deriving massive conclusions from a couple of trivial tests, view it with extreme scepticism.

Questionable Incentives Behind the Paper

We also must consider the motives behind research making sensational claims about AI systems. Incentives like publicity, funding, and citations can introduce bias into what results get reported and how they are framed.

For example, academics and journalists often profit from exaggerating the dangers or limitations of AI technologies. So it's unsurprising that papers provocatively alleging massive declines in performance like this one garner outsized attention. Even if the findings crumble under scrutiny, the exposure benefits authors and headlines drive clicks.

Rather than taking these papers at face value, we should view them through the lens of misaligned incentives. All researchers have motives, but some may prioritise publicity over rigour in selecting and presenting results.

The Complexity of Defining "Intelligence"

Part of the challenge in fairly evaluating AI systems is precisely defining nebulous concepts like general intelligence. The authors frame some narrow skills like mathematical reasoning as signs of intelligence. But this glosses over immense complexities.

For example, math skills require a vast web of capabilities - understanding symbols, logic, abstraction, concentrating, and creative problem-solving. Which of these constitutes intelligence? And does failing at any one indicate overall cognitive decline? The reality is far messier than simply labelling specific tasks as "smart."

Rather than getting distracted by philosophical debates, it may be more productive to focus evaluation on capabilities relevant to an AI's actual purpose. Judging a language model on language tasks yields more meaningful results than ill-defined proxies for human cognition.

The Subjectivity of Task Difficulty

The paper repeatedly refers to prime number checking as a "simple" task that GPT supposedly became worse at. But abstract reasoning about primes seems simple only in retrospect after you've mastered the concepts. The ability to introspect on primes as numbers with exactly two factors develops over years of education.

This highlights how subjective perceptions of task difficulty affect assumptions about what skills an AI should possess. Benchmark design requires accounting for the slippery nature of complexity. Marking capabilities like mathematical logic as basics commits the sin of hindsight bias. Creating fair tests means recognising the inherent subjectivity of so-called common sense.

The Need for Apples-to-Apples Testing

Another evaluation pitfall is the need for more controlled comparison. Many variables can affect outputs besides core system abilities when assessing changes between AI model versions.

For example, differences in prompt formulation, random seed sampling, and selection filters can all introduce confounding factors. These variables must be fixed for us to have an apples-to-apples comparison. Yet papers like this often compare performance on test sets with different prompts, criteria, and measures.

Isolating the effect of model changes requires rigorously controlling the evaluation protocol. Any differences in scores could reflect factors other than system intelligence if the tests themselves are not identical. This noise makes performance trends impossible to interpret.

The bottom line is that challenges like defining evaluation criteria, avoiding biased tests, and controlling variables make assessments far messier than papers like this acknowledge. Proper testing requires acknowledging and addressing these complexities rather than oversimplifying "intelligence" as a unitary concept measurable through narrow challenge questions.

Shoddy Research Harms Progress

Papers like this that misrepresent and exaggerate declines in AI abilities contribute to an environment of fear and scepticism around progress in the field. How can we make productive improvements if we can't correctly evaluate where today's systems succeed and fail?

Rather than conduct careful, rigorous assessments grounded in understanding the technology's capabilities, the authors leaned hard into shock value. The result is a tabloid-worthy paper that wildly extrapolates a few cherry-picked examples into sweeping but unjustified claims of declining performance.

We must promote more thoughtful, honest, and nuanced perspectives. Papers like this Omen of "Dumb GPT" are clickbait - nothing more. But the underlying technology continues rapidly advancing if we can filter out the hysteria.

Progress Continues When Hype Fades

So next time you see a breathless headline about "ChatGPT Forgetting Everything!" take it with a massive grain of salt. The reality is nuanced, but progress continues.

Shoddy research harms the field when it misrepresents and exaggerates declines in narrow areas while ignoring the bigger picture. We owe it to ourselves and society to promote honest, thoughtful perspectives on AI, not tabloid-worthy clickbait.

With more measured analysis and realistic expectations, we can filter out the hysteria and focus on the remarkable progress in natural language generation. But we have to keep hype from drowning out the truth.

Nack