Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?

📅 2025-06-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation benchmarks (e.g., APPS, LiveCodeBench) exhibit insufficient difficulty to rigorously evaluate advanced large language models (LLMs) on complex programming and high-level reasoning. Method: We introduce HLCE—a novel, high-difficulty benchmark comprising 235 ICPC/IOI World Finals–level problems (2010–2024)—accompanied by a unified online-offline sandbox for reproducible evaluation. We further propose the “self-awareness” task, wherein models assess their own solution correctness. Contribution/Results: Empirical analysis reveals that model self-assessment accuracy is consistently below 50% and weakly correlated with actual performance. State-of-the-art models—o4-mini(high) and Gemini-2.5 Pro—achieve only 15.9% and 11.4% pass@1, respectively, underscoring substantial room for improvement in complex program synthesis. HLCE thus provides a more rigorous, realistic, and standardized evaluation framework for next-generation LLMs.

Technology Category

Application Category

📝 Abstract
Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel"self-recognition"task to measure LLMs' awareness of their own capabilities. Results indicate that LLMs' self-recognition abilities are not proportionally correlated with their code generation performance. Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks. We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming. Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE).
Problem

Research questions and friction points this paper is trying to address.

Evaluating advanced LLMs on hardest programming competition problems
Assessing LLMs' self-awareness of code generation capabilities
Exploring scaling laws for LLMs on complex programming tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces HLCE with 235 ICPC and IOI problems
Designs harmonized online-offline sandbox for evaluation
Proposes self-recognition task for LLMs capability awareness
🔎 Similar Papers
No similar papers found.
X
Xiangyang Li
HUAWEI NOAH’S ARK LAB
X
Xiaopeng Li
Kuicai Dong
Kuicai Dong
Huawei Noah's Ark Lab, Nanyang Technological University
Natural Language ProcessingInformation ExtractionInformation RetrievalRAGRecommendation
Q
Quanhu Zhang
HUAWEI NOAH’S ARK LAB
R
Rongju Ruan
HUAWEI NOAH’S ARK LAB
Xinyi Dai
Xinyi Dai
Noah's Ark Lab, Huawei
Information RetrievalRecommender SystemLarge Language Models
X
Xiaoshuang Liu
HUAWEI NOAH’S ARK LAB
S
Shengchun Xu
HUAWEI NOAH’S ARK LAB
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
R
Ruiming Tang
HUAWEI NOAH’S ARK LAB