Which is the best AI? The case of Grok 3 and the benchmark problem

Innovation & technology

02 April 2025 4 min read

To determine which AI model is the best, it’s not enough to look at benchmarks. It’s essential to understand our needs and the cost we’re willing to bear to meet them.

Do Better Team

In mid-February, xAI—Elon Musk’s AI company—launched Grok 3 with great fanfare, presenting it as the best AI model to date. One of the charts released alongside the announcement appeared to support this claim:

The chart shows the AIME benchmark, which evaluates performance on one of the qualifying exams for the US Mathematical Olympiad, where Grok 3 apparently outperformed all other competitors. However, the xAI model was the only one on the chart with two shades of color. That’s because the darker area represents its first response, while the lighter area shows the most common answer across 64 different attempts.

Only Grok 3 includes this dual evaluation. Why isn’t this criterion applied uniformly to all models? In an article published in the Spanish media On Economía, Esteve Almirall, professor at Esade and expert in AI and innovation, answered: “because that way Grok 3 looks better in the picture.” With a standardized criterion, the ranking would look like this:

“If we only considered the first response, the result would be completely different: we would no longer be looking at the supposed best math model in the world, but rather a competitive one that, despite being the most recent, doesn’t outperform current leaders,” Almirall added.

The problem with benchmarks

The little trick xAI used to convince the public that its model was the best in the world reflects a widespread dynamic in the industry. In the AI sector, benchmarks are standardized tests used to assess and compare the performance of new models, especially in specific tasks such as natural language processing, computer vision, reasoning, and more.

These tests are becoming more sophisticated as models evolve, but that doesn’t always mean end users benefit from relevant improvements. “Few users participate in high-level math competitions or answer doctoral-level physics or biology exams. Most people want AI to translate accurately, respond correctly, and be easy to understand,” Almirall noted.

Some models are designed to maximize scores on tests instead of improving the user experience

And while benchmarks serve an important purpose—they offer a synthetic way to understand a model’s behavior—they also have limitations.

“When the measurement tool itself becomes the goal, the outcome can be misleading,” the professor wrote. “More and more AI models are trained with benchmarks in mind, not users. This has led to the emergence of terms like ‘gaming the benchmarks’ or ‘cooking the benchmarks,’ which refer to practices where models are designed to maximize test scores instead of improving the actual user experience.”

This tendency to optimize metrics is not exclusive to AI. In finance, the unorthodox practice of manipulating numbers to present a more favorable picture is known as ‘creative accounting.’ In some countries, inflated statistics used to meet political targets have led to the term ‘fake GDP.’ And a well-known example is the ‘cooking’ of electoral polls to benefit certain candidates.

What gets left out of the metrics?

Beyond their technical limitations, benchmarks also tend to overlook a dimension that’s becoming increasingly relevant in AI development: efficiency.

In the race to achieve top results on standardized metrics, many models are built under a “bigger is better” logic, aiming to boost performance by using more resources and computing power. But this approach raises questions about their sustainability—both environmental and economic.

Like any industry, AI should be subject to quality controls, impact criteria, and efficiency evaluations

The resource consumption required to train and run large language models (LLMs) like Grok 3 or ChatGPT has grown exponentially, while performance gains are increasingly marginal. And in any case, current evaluation systems don’t measure how much water, energy, or materials are needed to achieve a given score. For Professor Irene Unceta, academic director of the Bachelor’s Degree in Business and Artificial Intelligence at Esade, “the AI industry, like any other, should be subject to quality controls, impact criteria, and efficiency assessments.”

The best AI?

Nevertheless, there are some reasons for hope. The latest model from DeepSeek, developed in China under significant logistical constraints, signals the potential for a paradigm shift. Its emergence has shown that it is possible—and desirable—to achieve competitive results with far more efficient use of resources.

Answering the question of which is the best AI, therefore, is not just about looking at benchmarks. It means understanding what our needs are as individuals and as a society, and what cost we’re willing to accept to meet them.

Do Better Team

Do Better Content Team

View profile

All written content is licensed under a Creative Commons Attribution 4.0 International license.

Which is the best AI? The case of Grok 3 and the benchmark problem

The problem with benchmarks

What gets left out of the metrics?

The best AI?

Related posts

Is artificial intelligence sustainable?

The increasingly valuable skill of prototyping with AI

What can Europe learn from the success of DeepSeek, the new Chinese AI?

Do you want to receive the Do Better newsletter?