Evaluating and Improving Large Language Models for Competitive Program Generation

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit weak algorithmic reasoning, poor logical implementation, strict formatting requirements, and insufficient evaluation on real-world competitive programming tasks. Method: We construct a leakage-free benchmark of 80 authentic ICPC/CCPC problems and propose a fine-grained error taxonomy covering both general and domain-specific errors. We further design a generation optimization framework leveraging multi-turn dialogue-based repair and context-aware information enhancement. Contribution/Results: Using DeepSeek-R1 as the base model, our approach improves pass rate from 5/80 under basic prompting to 46/80 (+41), significantly enhancing accuracy, robustness, and formatting compliance on high-difficulty algorithmic problems. This work provides a reproducible methodology and empirical benchmark for evaluating and advancing LLMs’ capabilities in rigorous programming tasks.

Technology Category

Application Category

📝 Abstract
Context: Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is considered the most challenging problem in current LLM-based code generation. However, previous studies often evaluate LLMs using simple prompts and benchmark datasets prone to data leakage. Moreover, prior work has limited consideration of the diversity in algorithm types and difficulty levels. Objective: In this study, we aim to evaluate and improve LLMs in solving real-world competitive programming problems. Methods: We initially collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems. Leveraging DeepSeek-R1 as the LLM, we evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt. For incorrect submissions, we construct a fine-grained error taxonomy and then propose a targeted improvement framework by combining a multi-turn dialogue-based repair phase and an information-augmented regeneration phase. Results: Experimental results show that only 5 out of 80 problems are fully accepted when using basic prompts. For the unsolved problems, we construct the error taxonomy, including general errors (such as design, boundary, condition, data type, syntax, and input/output errors) and specialized errors (such as those in mathematical problems, greedy algorithms, and graph theories). After applying our proposed improvement strategies, we substantially increased the number of correct solutions, with 46 out of 80 problems successfully accepted.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for competitive programming generation challenges
Addressing data leakage and diversity in algorithm types
Improving LLM performance via error taxonomy and targeted strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed a curated benchmark with 80 ICPC/CCPC problems
Used DeepSeek-R1 LLM with OJ evaluation
Proposed multi-turn dialogue and information-augmented regeneration
🔎 Similar Papers
No similar papers found.
M
Minnan Wei
School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China
Z
Ziming Li
School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China
X
Xiang Chen
School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China
M
Menglin Zheng
School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China
Z
Ziyan Qu
School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China
Cheng Yu
Cheng Yu
PhD student, CSE, Ohio State University
audiovisualspeech enhancementspeech separationonline systemdeep learning
S
Siyu Chen
School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China
Xiaolin Ju
Xiaolin Ju
Associate Professor of Nantong University
Software EngineeringSoftware analysis and testingProgram debugging