Towards Understanding Self-play for LLM Reasoning

๐Ÿ“… 2025-10-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the intrinsic mechanisms by which self-play enhances mathematical reasoning in large language models (LLMs), systematically contrasting it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Method: Leveraging the absolute-zero reasoning framework, we employ pass@k evaluation, self-generated problem training, multi-strategy proposers and reward design, and analyze token-level entropy dynamics and parameter update sparsity. Contribution/Results: We first reveal that self-play improves generalization through infrequent yet high-information-parameter updates and substantial reduction in output distribution entropyโ€”gains robust both within and across domains. Unlike SFT (which relies on external supervision) or RLVR (which depends on scalar rewards), self-play critically hinges on high-quality self-generated feedback. Our findings establish a novel, interpretable, and quantifiable paradigm for reasoning optimization, supported by empirical evidence on parameter dynamics, entropy evolution, and cross-domain transferability.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
Problem

Research questions and friction points this paper is trying to address.

Analyzing training dynamics of self-play in LLM reasoning
Comparing self-play against RLVR and supervised fine-tuning methods
Understanding mechanisms behind self-play improvements and limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-play post-training improves LLM reasoning
Analyzing training dynamics using Absolute Zero Reasoner
Comparing self-play with RLVR and SFT methods
๐Ÿ”Ž Similar Papers
No similar papers found.