HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

๐Ÿ“… 2026-02-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the severe noise in the original Humanity's Last Exam (HLE) benchmark, which introduces significant bias in model evaluation and distorts cross-model comparisons. To mitigate this issue, the authors propose a two-stage verification and repair pipeline that integrates a transparent validation protocol, a fine-grained error taxonomy, expert double-blind revision, and model-assisted auditingโ€”ensuring precise corrections while preserving the original intent of each question. The resulting HLE-Verified benchmark comprises 641 verified questions, 1,170 revised and certified items, and 689 uncertain cases. Experimental results demonstrate that leading large language models achieve an average accuracy gain of 7โ€“10 percentage points on this refined benchmark, with improvements reaching 30โ€“40 percentage points on originally erroneous questions.

Technology Category

Application Category

๐Ÿ“ Abstract
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified
Problem

Research questions and friction points this paper is trying to address.

Humanity's Last Exam
benchmark noise
evaluation bias
noisy items
LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmark verification
error taxonomy
systematic revision
model evaluation
annotation noise reduction
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Weiqi Zhai
Alibaba Group
Zhihai Wang
Zhihai Wang
Qwen Team, Phd, USTC
Sample-Efficient Reinforcement LearningRL4LLMAgentic RL
J
Jinghang Wang
Alibaba Group
B
Boyu Yang
Alibaba Group
X
Xiaogang Li
Alibaba Group
X
Xiang Xu
Alibaba Group
B
Bohan Wang
Qwen Team, Alibaba Group
Peng Wang
Peng Wang
Qwen Team
World ModelRepresentation Learning
X
Xingzhe Wu
Qwen Team, Alibaba Group
A
Anfeng Li
Qwen Team, Alibaba Group
Q
Qiyuan Feng
Alibaba Group
Y
Yuhao Zhou
Alibaba Group
S
Shoulin Han
Alibaba Group
Wenjie Luo
Wenjie Luo
Nanyang Technological University
AIoT
Yiyuan Li
Yiyuan Li
University of North Carolina at Chapel Hill
Natural Language ProcessingComputational Linguistics
Yaxuan Wang
Yaxuan Wang
PhD Student of Computer Science, University of California, Santa Curz
machine learning
R
Ruixian Luo
Alibaba Group
G
Guojie Lin
Alibaba Group
Peiyao Xiao
Peiyao Xiao
Ph.D. candidate at University at Buffalo
Multi-objective optimizationFederated learningBilevel optimization
C
Chengliang Xu
Alibaba Group
Ben Wang
Ben Wang
University of Oklahoma
Z
Zeyu Wang
Alibaba Group
Z
Zichao Chen
Alibaba Group
Jianan Ye
Jianan Ye
University of Liverpool
Anomaly DetectionDeep LearningComputer Vision
Y
Yijie Hu
Alibaba Group