🤖 AI Summary
This work addresses a critical limitation in evaluating large language models (LLMs) on competitive programming: the conflation of algorithmic reasoning and code implementation capabilities. To disentangle these aspects, the study introduces a novel evaluation paradigm centered on natural language solution explanations—known as editorials—decoupling problem-solving from coding. The authors construct a new dataset comprising expert-written editorials and comprehensive test suites for ICPC-style problems. Using an LLM-as-a-judge protocol, they demonstrate that even when provided with expert-level editorials, current models struggle significantly in the implementation phase. Moreover, the quality of generated editorials strongly correlates with solution success, revealing a fundamental bottleneck in LLMs’ ability to fully formulate coherent algorithmic strategies.
📝 Abstract
Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.