Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses a critical limitation in evaluating large language models (LLMs) on competitive programming: the conflation of algorithmic reasoning and code implementation capabilities. To disentangle these aspects, the study introduces a novel evaluation paradigm centered on natural language solution explanations—known as editorials—decoupling problem-solving from coding. The authors construct a new dataset comprising expert-written editorials and comprehensive test suites for ICPC-style problems. Using an LLM-as-a-judge protocol, they demonstrate that even when provided with expert-level editorials, current models struggle significantly in the implementation phase. Moreover, the quality of generated editorials strongly correlates with solution success, revealing a fundamental bottleneck in LLMs’ ability to fully formulate coherent algorithmic strategies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.

Problem

Research questions and friction points this paper is trying to address.

competitive programming

problem solving

code generation

algorithmic reasoning

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

problem solving

code generation

editorial-based evaluation