On the Importance of Benchmarking in AI

April 18, 2025 - Written for Kite AI

“Tell me how you measure me and I will tell you how I behave” -Eli Goldratt

If there is one takeaway from the generative AI era, it may be that model naming is a harder problem than model engineering. From Geminis to DeepSeeks, from chats to thinkings, and from previews to minis to experimentals, AI engineers must be quite glad that Cantor’s diagonal argument holds for numbers between 1 and 3. It is therefore no surprise that the people responsible for implementing these models constantly complain about the difficulties of keeping track of the best-performing models.

Recently, I came across this awesome spreadsheet, created by Harlan Lewis, that covers, in detail, the performances of many of the top generative AI models, and how the performance has shifted over time. I appreciate resources like this because they are simple to understand and provide a great starting point for knowing the benchmarked performances of the models.

But, as I was reaching through this spreadsheet, I couldn't help but wonder about the broader implications of model benchmarking as we move towards agentic solutions and the barriers that we are running into along the way.

Put simply, the definition of a benchmark is a standard point of reference against which things may be compared. In technical terms, this is where researchers share a dataset with a defined metric, and various models are evaluated on the dataset according to this metric. The fundamental assumption here is that higher metric performance means better models.

And benchmarking is important because benchmarks tell us two things:

  1. What models are best for downstream applications, and
  2. The direction along which to improve the models (i.e., higher performances)

The latter point has been the crux of machine learning, or more generally, predictive modeling, as argued by David Donoho in his article titled “50 Years of Data Science”. This is because of something called the Common Task Framework, which Donoho outlines as follows:

  1. A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation.
  2. A set of enrolled competitors whose common task is to infer a class prediction rule from the training data.
  3. A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset, which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule.

As you can see, the Common Task Framework is a formalization of a benchmark discussed above, with a few added aspects to prevent cheating. Donohoe then goes on to state, in my opinion, a crucial observation: “those fields where machine learning has scored successes are essentially those fields where CTF [the Common Task Framework] has been applied systematically.”

And systematically has it been applied - from quintessential examples like Kaggle and Papers with Code to individual tasks like language translation, voice recognition, and image classification, it is not hard to see how influential the Common Task Framework has been over the last decade. And intuitively, it seems that the Common Task Framework has worked well for problems that can be modeled in this framework - ones that are unambiguous, have clear metric(s) of success, and have well-defined scopes of data.

But what happens when problems cannot be modeled by the Common Task Framework?

Many believe that one such problem is artificial general intelligence. As argued by Raji et al. in the paper titled “AI and the Everything in The Whole Wide World Benchmark”, the issue is that, by definition, benchmarks restrict the scope of what can be claimed, which directly contrasts with the “general” goal of artificial general intelligence. Put simply, a model performing well on a benchmark strictly means that the model performed well on a benchmark, and we should be cautious to make any claims about its generality or applicability in settings outside of the benchmark.

To see why this might be the case, consider any subjective task you do in your day-to-day life and think about what it would mean to design a Common Task Framework benchmark for it. What metric do you select? How do you measure this metric? What is the scope of data? As soon as you define the benchmark, I’m sure that you could think of cases that contradict the data set scope or defined metric measurement. And this doesn’t even begin to address the recent challenges of benchmarking in a world where the training data might already include the benchmark.

What does all of this mean? Well, either

  1. The Common Task Framework is sufficient for artificial general intelligence, but we haven’t constructed the correct benchmark yet for it, or
  2. The Common Task Framework is insufficient for artificial general intelligence, and we will need some paradigm shift to achieve it (assuming Donohoe’s argument holds).

I don’t think that anyone knows the answer to this yet, and how this plays out remains to be seen. But in the short term, one thing is clear - data access is critical as long as the Common Task Framework remains the dominant approach.