As synthetic intelligence techniques started scoring extraordinarily excessive on lengthy used tutorial benchmarks, researchers observed a rising subject. The checks that when challenged machines have been not tough sufficient. Well-known evaluations such because the Large Multitask Language Understanding (MMLU) examination, which had beforehand been seen as demanding, now fail to correctly measure the capabilities of in the present day’s superior AI fashions.
To resolve this downside, a worldwide group of practically 1,000 researchers, together with a professor from Texas A&M College, developed a brand new kind of take a look at. Their aim was to construct an examination that’s broad, tough, and grounded in professional human data in ways in which present AI techniques nonetheless wrestle to deal with.
The result’s “Humanity’s Final Examination” (HLE), a 2,500 query evaluation protecting arithmetic, humanities, pure sciences, historical languages, and a variety of extremely specialised tutorial fields. Particulars of the undertaking seem in a paper revealed in Nature, and extra details about the examination is on the market at lastexam.ai.
Among the many many contributors is Dr. Tung Nguyen, tutorial affiliate professor within the Division of Laptop Science and Engineering at Texas A&M. Nguyen helped write and refine lots of the examination questions.
“When AI techniques begin performing extraordinarily properly on human benchmarks, it is tempting to assume they’re approaching human-level understanding,” Nguyen mentioned. “However HLE reminds us that intelligence is not nearly sample recognition — it is about depth, context and specialised experience.”
The aim of the examination was to not trick or defeat human take a look at takers. As a substitute, the aim was to rigorously determine areas the place AI techniques nonetheless fall brief.
A International Effort to Measure AI’s Limits
Specialists from all over the world wrote and reviewed the questions included in Humanity’s Final Examination. Every downside was rigorously designed so it has one clear, verifiable reply. The questions have been additionally crafted to stop fast options by means of easy web searches.
The subjects come from superior tutorial challenges. Some duties contain translating historical Palmyrene inscriptions, whereas others require figuring out tiny anatomical buildings in birds or analyzing detailed options of Biblical Hebrew pronunciation.
Researchers examined each query towards main AI techniques. If any mannequin was capable of reply a query appropriately, that query was faraway from the ultimate examination. This course of ensured the take a look at remained simply past what present AI techniques can reliably resolve.
Early testing confirmed that the technique labored. Even highly effective AI fashions struggled with the examination. GPT-4o achieved a rating of two.7 %, whereas Claude 3.5 Sonnet reached 4.1 %. OpenAI’s o1 mannequin carried out considerably higher with 8 %. Probably the most succesful techniques to date, together with Gemini 3.1 Professional and Claude Opus 4.6, have reached accuracy ranges between about 40 % and 50 %.
Why New AI Benchmarks Are Wanted
Nguyen defined that the difficulty of AI surpassing older checks is greater than a technical concern. He contributed 73 of the two,500 publicly out there questions in HLE, the second highest quantity amongst contributors, and wrote essentially the most questions associated to arithmetic and pc science.
“With out correct evaluation instruments, policymakers, builders and customers threat misinterpreting what AI techniques can truly do,” he mentioned. “Benchmarks present the inspiration for measuring progress and figuring out dangers.”
In keeping with the analysis staff, excessive scores on checks initially designed for people don’t essentially point out real intelligence. These benchmarks primarily measure how properly AI can full particular duties created for human learners, quite than capturing deeper understanding.
Not a Menace, however a Instrument
Regardless of the dramatic identify, Humanity’s Final Examination isn’t meant to recommend that people have gotten out of date. As a substitute, it highlights the massive quantity of information and experience that also stays uniquely human.
“This is not a race towards AI,” Nguyen mentioned. “It is a methodology for understanding the place these techniques are sturdy and the place they wrestle. That understanding helps us construct safer, extra dependable applied sciences. And, importantly, it reminds us why human experience nonetheless issues.”
Constructing a Lengthy Time period AI Benchmark
Humanity’s Final Examination is designed to function a sturdy and clear benchmark for future AI techniques. To assist that aim, the researchers have launched some questions publicly whereas conserving the bulk hidden in order that AI fashions can’t merely memorize the solutions.
“For now, Humanity’s Final Examination stands as one of many clearest assessments of the hole between AI and human intelligence,” Nguyen mentioned, “and regardless of speedy technological advances, it stays huge.”
A Large Worldwide Analysis Effort
Nguyen emphasised that the size of the undertaking demonstrates the worth of collaboration throughout disciplines and international locations.
“What made this undertaking extraordinary was the size,” he mentioned. “Specialists from practically each self-discipline contributed. It wasn’t simply pc scientists; it was historians, physicists, linguists, medical researchers. That variety is precisely what exposes the gaps in in the present day’s AI techniques — maybe sarcastically, it is people working collectively.”
