OpenSIR: Open-Ended Self-Improving Reasoner

πŸ“… 2025-11-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing LLM reasoning reinforcement learning relies on human-annotated, verifiable rewards, limiting performance beyond human-level capabilities; while self-play holds promise, it typically requires external validators and lacks open-ended, continuous evolution. This paper introduces OpenSIRβ€”the first fully self-cyclic, externally supervision-free self-play reasoning framework. In OpenSIR, models autonomously alternate between teacher and student roles to generate and solve novel mathematical problems, enabling open-ended knowledge exploration via difficulty-adaptive calibration and diversity-driven problem generation. Its core innovation is a closed-loop self-improvement mechanism that supports sustained reasoning capability evolution without external verification. Experiments demonstrate significant gains: Llama-3.2-3B-Instruct improves by 4.4 and 5.6 percentage points on GSM8K and College Math, respectively; Gemma-2-2B-Instruct achieves a substantial 20.2-point gain on GSM8K.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.
Problem

Research questions and friction points this paper is trying to address.

Enables open-ended self-improvement without external supervision
Generates novel problems optimizing for difficulty and diversity
Advances mathematical reasoning through co-evolving teacher-student roles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-play framework alternates teacher and student roles
Optimizes problem generation for difficulty and diversity
Enables open-ended learning without external supervision
πŸ”Ž Similar Papers
No similar papers found.