OpenAI's New Models Hallucinate More Than Ever Before

By OpenAI In their own testing, their latest product reasoning models, O3 and O4 -mini, experiences substantially greater hallucinations compared to o1.

First reported by TechCrunch , OpenAI's system card Detailed the PersonQA assessment outcomes, aimed at detecting hallucinations. According to these findings, o3’s hallucination rate stands at 33%, whereas o4-mini’s rate reaches 48%—essentially half the time. In contrast, o1 has a hallucination rate of only 16%, indicating that o3 produced roughly double the number of hallucinations compared to o1.

The system card indicated that o3 “generally produces a higher number of claims, which includes both more accurate statements and an increased amount of incorrect or fabricated ones.” However, OpenAI remains uncertain about the root cause, merely stating, “Further investigation is required to comprehend why this occurs.”

OpenAI’s reasoning models are advertised as being more precise compared to their non-reasoning counterparts such as GPT-4o and GPT-4.5. This enhanced accuracy stems from utilizing greater computational resources, allowing them to “dedicate more time for reflection prior to generating a response.” described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes."

However, the system card for GPT-4.5 The release in February demonstrated a 19 percent hallucination rate on the PersonQA assessment. This document also contrasts it with GPT-4o, which exhibited a 30 percent hallucination rate.

Evaluating benchmarks can be challenging. They may be subjective, particularly when created internally, and research has found flaws in their datasets and even how they evaluate models.

Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard.

That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates.

Additionally, models generally become more precise when they leverage web searches for sourcing their responses. However, to utilize ChatGPT with search capabilities, OpenAI needs to integrate such functionalities. shares data With external search service providers, and businesses utilizing OpenAI models internally may prefer not sharing their prompts with those services.

Nevertheless, if OpenAI admits that their latest o3 and o4-mini models generate more frequent inaccuracies compared to their non-reasoning counterparts, this could pose an issue for users. We have contacted OpenAI and will include their response in this article once we receive it.

If you liked this tale, make sure to follow on MSN.