Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current open-weight large language models (LLMs) fall short of gold-medal performance on the International Olympiad in Informatics (IOI), primarily due to limitations in complex algorithmic reasoning and program synthesis. Method: We propose GenCluster, a novel framework that enables the open-weight model gpt-oss-120b to achieve gold-medal performance on IOI 2025. GenCluster employs large-scale candidate generation, behavior-driven clustering for solution-space compression, ranking-based filtering, and iterative submission—optimized under strict verification budget constraints. Contribution/Results: Experiments demonstrate stable performance gains with increasing test-time compute, substantially narrowing the gap between open- and closed-weight models. This work not only validates the state-of-the-art programming reasoning capability of open-weight LLMs but also establishes a reproducible, transparent, and resource-controllable paradigm for evaluating AI reasoning—grounded in rigorous empirical methodology and computational efficiency.

Technology Category

Application Category

📝 Abstract

Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Achieving IOI gold medal performance with open-weight models

Scaling test-time compute to explore diverse solution spaces

Narrowing performance gap between open and closed AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable test-time compute framework using open-weight models

Combines generation clustering ranking and round-robin submission

Efficiently explores diverse solutions under limited validation budgets

🔎 Similar Papers

No similar papers found.