HHEM | Flash Update: Fast. But Are They Furious?

What a week it’s been. And it’s only Wednesday.

On Monday, OpenAI launched their new GPT4o model, a faster and cheaper model (half the price and twice the speed of GPT4-Turbo), while also being an “omni-modal” model, with a very compelling demo demonstrating how fast it is, and what it means to be omni-modal.

On Tuesday Google announced (among many other cool announcements) the general availability of Gemini 1.5 Flash. This model is also faster and cheaper while supporting the previous long context length of 1M tokens that we saw in the full Google Pro 1.5 model.

Our team was able to quickly evaluate these models for their tendency to hallucinate using Hughes Hallucination Evaluation Model (HHEM).

The results?

Well, the models are certainly fast. But they are not as furious.

As our updated leaderboard shows, both models perform worse than their earlier iterations. GPT4-Turbo sported an amazing 2.5% and GPT4o worsened to 3.7%. Google Gemini 1.5 Pro shows a hallucination rate of 4.6% whereas the Flash model worsened to 5.3%.

Reflecting on this, it actually makes sense that models optimized for speed and cost would lose some of their capabilities. Of course, we’d rather that won’t be the case, but alas that’s a common engineering trade-off that seems to apply here too.

Figure 1: GPT4-turbo vs GPT4o hallucination rates

Figure 2: Gemini-Pro-1.5 vs Gemini-1.5-Flash hallucination rates

And finally, the news that Ilya is leaving OpenAI just dropped. As many have said, he is one of the most amazing innovators in the field, and I think I can safely say we are all extremely grateful for his contributions that led us all to this age of AI.