ProBench: Benchmarking Large Language Models in Competitive Programming

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code-generation benchmarks inadequately assess large language models’ (LLMs) high-level code reasoning capabilities in competitive programming. Method: We introduce ProBench—the first LLM evaluation benchmark specifically designed for algorithmic contests—curated from real Codeforces problems (July–December 2024). It features difficulty grading, fine-grained algorithmic tagging, and multidimensional capability assessment. Our novel evaluation framework integrates chain-of-thought analysis, error attribution, and hierarchical reasoning-depth quantification, augmented by a fairness-preserving mechanism grounded in real online submission verification. Automated problem acquisition, unified attribute modeling, and quantitative scoring enable systematic evaluation. Contribution/Results: We benchmark nine state-of-the-art LLMs; QwQ-32B-Preview achieves the highest score (20.93). Results identify algorithmic adaptability and reasoning sufficiency as critical bottlenecks, underscoring ProBench’s utility in diagnosing advanced coding reasoning deficits.

Technology Category

Application Category

📝 Abstract
With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.
Problem

Research questions and friction points this paper is trying to address.

Assess advanced LLMs in high-level code reasoning
Benchmark LLMs using competitive programming problems
Identify key areas for programming capability enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

ProBench benchmarks LLMs in competitive programming.
Uses real test results from Codeforces, Luogu, Nowcoder.
Assesses LLMs via thought chain, error type, reasoning depth.
🔎 Similar Papers
No similar papers found.