We Can't Always Measure Superhuman Capability
(And That's A Problem.)
We might want to know when AI systems are superhuman at some task - which, for this post, will mean far better than the top humans. But the way we measure the capabilities of these systems, for many tasks, we might not be able to tell if it happens.
To start, we know how to measure AI system capabilities in domains where humans currently dominate. For medical diagnostics, for example, we have curated datasets that human experts have judged, and we see if the AI system can do it as well as them. And we do the same for graduate-school level science knowledge. But this doesn’t scale.
Imagine it’s 1970, and we are trying to figure out how to evaluate new chess-playing systems. Mac Hack VI already can play decently, it wins some tournaments, but it can’t compete with grandmasters, and we want to measure progress. We can get human grandmasters to grade moves, and check if chess engines can “solve” getting the best move, or at least avoid blunders. And this method would have worked for decades - it gives us a very good signal, and the rankings will basically match the capabilities of the systems. But fast-forward to the late 1990s, and suddenly this would stop working; the moves grandmasters play aren’t always the ones that win games. And a decade later, all of the Chess engines are better than the best humans, and human ratings would be meaningless.
Obviously, this isn’t how we measure chess capability. Just playing against the systems, a grandmaster couldn’t necessarily tell if Chess playing AI was just a bit better, or was strongly superhuman. Instead, AI systems play games and we rank the systems against one another. That’s how we know Chess playing didn’t actually get wildly better than the best Grandmaster humans for around a decade after Deep Blue. They are certainly there now - top AI chess systems completely dominate. But this doesn’t work outside a competitive domain where we can check performance against the superhuman baseline provided by competition.
In non-competitive domains, we often train narrow AI based on human-labeled gold-standards, and the systems can get better than most humans that way. We trained general AI systems on math, and they outperform most top humans. Even outside of competitive math, it’s getting close to the top human levels. Of course, such math can be checked, and we’ll probably know when it gets superhumanly capable in math, such as when it can rigorously prove things humans cannot. But cross-domain generalization and stronger capabilities in adjacent domains mean that we might see superhuman capabilities in domains that aren’t easy to check.
Coding systems aren’t human-level, much less superhuman, for now. They sometimes make stupid mistakes that careful human programmers don’t. But for contest math, contest programming, and other evaluable domains, we’re seeing LLMs get perfect scores on the hardest tests humans take. Can we tell the difference between human and strongly superhuman performance? Not really - in many cases, both humans and AI score perfectly. Sometimes, they do slightly better, but as in the grandmaster-scored chess example, how would we know how much better?
Every time an AI system gets the answer obviously wrong, we say it’s not yet human level. And there are certainly domains where this is true - it generates objectively wrong labels on images, it recommends clearly incorrect plans for software development, it gets lost if we try to have it do tasks that take humans too long. But when the chess playing bot recommends something a top grandmaster thinks is wrong, we don’t say the AI is wrong. More generally, where top human performance is far short of perfection, and we don’t have some objective measure, we have problems ensuring that we have a correct rubric to score AI systems on superhuman problems.
We are getting close to superintelligence in more and more areas - but we probably won’t know when it happens. That is, when the AI system starts getting close enough to the top human level, we often lose the ability to tell if it’s better than we are. Which doesn’t mean that we can’t use current techniques to make new generations of models better - as is already happening. But it means we won’t have a way to measure their capabilities beyond human level.
Cars driven by AI are superhumanly safe by now - they get into fewer accidents. It’s good that we have a measurable outcome. But we probably can’t tell the difference between self-driving cars that avoid accidents caused by human drivers and those that don’t. And we limit them to the speed limit, so we don’t know if they would remain safer than humans even if they were allowed to drive far faster than we do. They could end up strongly superhuman, instead of just slightly safer, and we probably wouldn’t know it.
As another example, what happens when AI is better at diagnosing medical conditions than humans? What do we do when an AI system can diagnose cancer on a scan that no human can see? Do we even notice if those are the patients who develop human-diagnosable cancer a year later? The model starts scoring “worse” than humans, and we don’t know that this is because it’s actually smarter than we are. Of course, there’s an eventual human-discoverable outcome - which means that if someone bothers checking - which seems unlikely - we’ll eventually know if the AI figured it out better than we did.
But what do we do when an AI system goes beyond human level in treating conditions? When it prescribes medicines humans would not, because it knows better than the doctors? When it recommends treatments that are counter-indicated but would treat more effectively and safer than what humans recommend? This goes beyond us not noticing superhuman level performance - we would reject the “bad” suggestions, and would not be able to tell at all.
Again, when systems get superhuman in most domains, we won’t know it - because what we do to evaluate systems is usually anchored to humans. And when superhuman capabilities are dangerous, that should worry us.

