OJBench: A Competition Level Code Benchmark For Large Language Models

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing code benchmarks inadequately assess large language models’ (LLMs) multi-step algorithmic reasoning capabilities on programming contest–level tasks. Method: We introduce OJBench—the first evaluation benchmark grounded in authentic National Olympiad in Informatics (NOI) and International Collegiate Programming Contest (ICPC) problems—comprising 232 high-difficulty, tightly constrained real contest questions. We formally define and implement a competition-grade code reasoning evaluation framework emphasizing algorithmic thinking, robustness, and decomposition of complex logic; ensure question quality via expert curation and standardized reconstruction; and establish a unified evaluation infrastructure covering 37 cross-paradigm models (open/closed-source, reasoning/non-reasoning). Contribution/Results: Experiments reveal that state-of-the-art reasoning models—including o4-mini and Gemini-2.5-pro-exp—achieve sub-20% average pass rates on OJBench, exposing fundamental limitations in advanced code reasoning.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models'reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.

Problem

Research questions and friction points this paper is trying to address.

Assessing competitive-level code reasoning in LLMs

Evaluating full spectrum of code reasoning capabilities

Identifying challenges in solving competition-level problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

OJBench assesses competitive-level code reasoning

Includes 232 NOI and ICPC problems

Evaluated 37 models, revealing performance gaps

🔎 Similar Papers

No similar papers found.