OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional algorithmic benchmarks are approaching saturation, necessitating more challenging evaluation tasks to advance algorithmic reasoning. This paper introduces OIBench—the first private benchmark targeting International Olympiad in Informatics (IOI)-level problems—comprising 250 original, human-crafted, and multi-round-validated programming tasks spanning diverse paradigms and computational complexity classes. To ensure rigorous assessment, we propose three key innovations: (1) contamination-immune benchmark design to prevent data leakage; (2) execution-trace-based quantification of time/space completion curves; and (3) a human-AI collaborative empirical evaluation framework. Experiments show that state-of-the-art models surpass most human contestants in both correctness and efficiency, yet remain inferior to optimal hand-written solutions. The dataset is publicly released on Hugging Face, establishing a standardized, high-difficulty benchmark for evaluating code reasoning capabilities.

Technology Category

Application Category

📝 Abstract
As models become increasingly sophisticated, conventional algorithm benchmarks are increasingly saturated, underscoring the need for more challenging benchmarks to guide future improvements in algorithmic reasoning. This paper introduces OIBench, a high-quality, private, and challenging olympiad-level informatics dataset comprising 250 carefully curated original problems. We detail the construction methodology of the benchmark, ensuring a comprehensive assessment across various programming paradigms and complexities, and we demonstrate its contamination-resistant properties via experiments. We propose Time/Space Completion Curves for finer-grained efficiency analysis and enable direct human-model comparisons through high-level participant evaluations. Our experiments reveal that while open-source models lag behind closed-source counterparts, current SOTA models already outperform most human participants in both correctness and efficiency, while still being suboptimal compared to the canonical solutions. By releasing OIBench as a fully open-source resource (https://huggingface.co/datasets/AGI-Eval/OIBench), we hope this benchmark will contribute to advancing code reasoning capabilities for future LLMs.
Problem

Research questions and friction points this paper is trying to address.

Need challenging benchmarks for advanced algorithmic reasoning
Introduce OIBench: a high-quality olympiad-level informatics dataset
Evaluate model performance against human participants and canonical solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OIBench, a challenging olympiad-level informatics dataset
Proposes Time/Space Completion Curves for efficiency analysis
Enables direct human-model comparisons via participant evaluations
🔎 Similar Papers
No similar papers found.