🤖 AI Summary
Traditional algorithmic benchmarks are approaching saturation, necessitating more challenging evaluation tasks to advance algorithmic reasoning. This paper introduces OIBench—the first private benchmark targeting International Olympiad in Informatics (IOI)-level problems—comprising 250 original, human-crafted, and multi-round-validated programming tasks spanning diverse paradigms and computational complexity classes. To ensure rigorous assessment, we propose three key innovations: (1) contamination-immune benchmark design to prevent data leakage; (2) execution-trace-based quantification of time/space completion curves; and (3) a human-AI collaborative empirical evaluation framework. Experiments show that state-of-the-art models surpass most human contestants in both correctness and efficiency, yet remain inferior to optimal hand-written solutions. The dataset is publicly released on Hugging Face, establishing a standardized, high-difficulty benchmark for evaluating code reasoning capabilities.
📝 Abstract
As models become increasingly sophisticated, conventional algorithm benchmarks are increasingly saturated, underscoring the need for more challenging benchmarks to guide future improvements in algorithmic reasoning. This paper introduces OIBench, a high-quality, private, and challenging olympiad-level informatics dataset comprising 250 carefully curated original problems. We detail the construction methodology of the benchmark, ensuring a comprehensive assessment across various programming paradigms and complexities, and we demonstrate its contamination-resistant properties via experiments. We propose Time/Space Completion Curves for finer-grained efficiency analysis and enable direct human-model comparisons through high-level participant evaluations. Our experiments reveal that while open-source models lag behind closed-source counterparts, current SOTA models already outperform most human participants in both correctness and efficiency, while still being suboptimal compared to the canonical solutions. By releasing OIBench as a fully open-source resource (https://huggingface.co/datasets/AGI-Eval/OIBench), we hope this benchmark will contribute to advancing code reasoning capabilities for future LLMs.