AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing programming contest benchmarks suffer from insufficient problem difficulty, inadequate coverage, and low-quality test cases, leading to severe evaluation bias in assessing large language models (LLMs). Method: We introduce CodeContest, the first high-fidelity benchmark tailored to elite competitions (e.g., IOI, ICPC), comprising 1,200+ manually curated and annotated challenging problems. We further propose a human-in-the-loop test case generation framework that integrates automated construction with expert validation to ensure functional completeness and boundary-case robustness. Contribution/Results: CodeContest substantially widens the performance gap between LLMs and top human competitors—achieving sub-15% average pass@1 accuracy across mainstream models. It systematically exposes fundamental limitations of LLMs in complex algorithmic reasoning, multi-step logical composition, and error recovery. By establishing a more rigorous and trustworthy evaluation standard, CodeContest advances research in code generation and reasoning.

Technology Category

Application Category

📝 Abstract
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' coding abilities in premier programming competitions
Addressing overstatement of model proficiency versus elite programmers
Providing rigorous assessment with expert-validated comprehensive test suites
Innovation

Methods, ideas, or system contributions that make the work stand out.

Premier programming competition problems benchmark
Hybrid automated and human expert test suites
Rigorous reliable assessment for LLM capabilities
🔎 Similar Papers
No similar papers found.
Z
Zihan Wang
ByteDance, M-A-P
Jiaze Chen
Jiaze Chen
Bytedance
Natural Language Processing
Z
Zhicheng Liu
ByteDance, M-A-P
M
Markus Mak
ByteDance, M-A-P
Y
Yidi Du
ByteDance, M-A-P
G
Geonsik Moon
ByteDance, M-A-P
L
Luoqi Xu
ByteDance, M-A-P
A
Aaron Tua
ByteDance, M-A-P
K
Kunshuo Peng
ByteDance, M-A-P
Jiayi Lu
Jiayi Lu
Beihang University
Autonomous VehicleComputer VisionSOTIFADAS
M
Mingfei Xia
ByteDance, M-A-P
B
Boqian Zou
ByteDance, M-A-P
C
Chenyang Ran
ByteDance, M-A-P
G
Guang Tian
ByteDance, M-A-P
S
Shoutai Zhu
ByteDance, M-A-P
Y
Yeheng Duan
ByteDance, M-A-P
Z
Zhenghui Kang
ByteDance, M-A-P
Z
Zhenxing Lin
ByteDance, M-A-P
S
Shangshu Li
ByteDance, M-A-P
Qiang Luo
Qiang Luo
Principal Investigator, ISTBI (类脑智能科学与技术研究院), Fudan University
Computational PsychiatryNeuroImageComplex Causal Models
Q
Qingshen Long
ByteDance, M-A-P
Zhiyong Chen
Zhiyong Chen
Shanghai Jiao Tong University
6G networksWireless CommunicationsComputing and Caching Networks
Yihan Xiao
Yihan Xiao
Meta Platforms, Inc., University of California, Berkeley
Machine LearningFairnessComputational Materials ScienceSolid-State BatteriesPolymers
Y
Yurong Wu
ByteDance, M-A-P
Daoguang Zan
Daoguang Zan
ByteDance Seed
Large Language ModelSoftware EngineeringCoding Agent