Thoughts on adversarial evals

January 29, 2025

There's a recent string of LLM evals over the past year (SimpleQA¹, HLE, etc) that have been adversarially filtered, where questions are specifically chosen because LLMs perform poorly on them.

In theory, there's probably some merit to this. Existing evals have increasingly become saturated, and building adversarial evals is a relatively easy and surefire way of ending up with a set of tasks that are "hard" for existing language models².

I think there are some issues with this approach though. Broadly, I'd argue that evals serve two purposes:

Evaluating LLM (or rather any general system) performance relative to other systems (i.e., most evals)
Evaluating system performance on an absolute basis to understand progress (i.e., ARC-AGI)

In both cases, the distribution of tasks in an eval is paramount³. Relative comparisons between two systems should be well-calibrated, and absolute performance should be meaningfully reflective distributions of things we care about.

A 15% performance delta between System X and System Y on an eval set should reflect a genuine improvement in the general case (for this kind of task), not just superiority on these specific questions.

Likewise, we also care about the question, "How good is this system at this kind of task?" — that is, does eval performance generalize to the real world?

Unfortunately, adversarial evals aren't generally aren't really good at doing either.

For relative comparisons, the idea is theoretically sound — find questions that systems struggle with today and measure improvement over time. In practice, this involves generating a bank of tasks, testing them on a handful of state-of-the-art models, and discarding those that models handle well.

Because questions are explicitly adversarial to a set of models, they end up capturing two things:

Questions where our models generally perform poorly today
Questions where the specific chosen models struggle

In general, there's probably correlation between these two subsets — intuitively most frontier models are trained on similar data with semi-similar training regimes and thus there is value in understanding relative performance on these evals. Even then, when we overfit evals to a specific set of models, we reduce the diversity and value of the evals and introduce poor calibration.

One question you could ask is: are test-time compute models (R1, O1, etc) genuinely x% better than previous generation language models at PhD-level math? Or do they simply have different weaknesses compared to non-RL trained models? Both probably hold some degree of truth, but many of our evals today probably don't capture this.

By nature then, any evals generated adversarially don't accurately capture the distribution of useful and real world tasks. There are some cases where the distribution of useful work follows a power-distribution — many fields of research come to mind ⁴. There are many other cases where the distribution of useful work is either uniform or normally distributed, and it might make more sense to understand how generally useful a system is at a macro level.

If we look at the past ~3 years of LLM research and progress, basically all of it has been either explicity (ie, specifically looking for training data distributions that improve performance on a subset of questions ⁷) or implicitly guided by evals. Much work in ML, agentic workflows, and language modeling is highly iterative, where progress is measured through benchmark performance, and evals define how resources get allocated, where scientists spend their time, and broader research directions.

There's a fair case to be made that using a diverse set of (adversarial and not) evals mitigates these issues. This probably helps, but doesn't fix the lack of useful distributions — and given the how many have turned to adversarially filtering questions to generate harder evals, I suspect we risk some degree of the eval-driven model collapse.

When we end up optimizing systems for an artificial distribution of tasks, we typically end up seeing overfitting at a meta level — where our systems become good at handling specifically crafted challenge problems rather than improving on the underlying capabilities we actually care about⁶.

A lot of attention has been paid towards challenging math problems for example, primarily because they are easily verifiable. There is some debate on whether math performance improvements generalize across the board ⁵, but regardless, building useful AGI / AI systems should fundamentally be about capturing the largest portion of tasks that are societally useful. There's arguments to be made that instruction following consistency, for example, matters far more than math performance for many AI use cases.

To be clear, eval saturation is a real and clear issue. There's certainly evidence that improved performance on adversarial evals is generalizable to some degree. The solution isn't necessarily to abandon adversarial evals entirely, but I think we have to be very careful about how and how much we adversarially filter questions.

1. SimpleQA was specifically designed to include questions that GPT models hallucinated on. HLE uses LLMs as part of the initial question screening process.

2. Or rather, questions existing LLMs can't answer, which isn't necessarily the same thing.

3. At least, if you want to have "useful" evals.

4. That is, a few major breakthroughs yield the vast majority of tangible value and progress. Often, these breakthroughs emerge from the hardest problems.

5. Empirically, they do to some degree.

6. LLM failure modes on Alice in Wonderland style questions exemplify this.

7. At least one major company I know of who trained their own models included Quizlet in their training distribution because they found it improved their score on MMLU