SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the reliance of language model reasoning on manually annotated data and domain-specific reward signals. We propose an unsupervised self-play reinforcement learning framework wherein multiple agents engage in zero-sum games (e.g., Kuhn Poker) via online, multi-round training to autonomously generate progressively challenging tasks—forming an infinitely scalable, self-generated curriculum. A novel role-conditioned advantage estimation (RAE) mechanism is introduced to systematically elicit three transferable cognitive patterns: decomposition, expectation computation, and case-by-case analysis. On Qwen3-4B-Base, single-game training improves mathematical and general reasoning by 8.6% and 8.4%, respectively—surpassing expert-trajectory supervised baselines. Multi-game joint training further yields an average 2.0% gain for the stronger model DeepSeek-R1-Distill-Qwen-7B, demonstrating cross-domain generalization and scalability of the learned reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for human-curated data in reasoning training

Enables autonomous learning via multi-agent self-play games

Develops transferable reasoning skills through zero-sum games

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-play framework for autonomous reasoning development

Online multi-turn multi-agent reinforcement learning system

Role-conditioned advantage estimation stabilizes training

🔎 Similar Papers

A Survey on Self-play Methods in Reinforcement Learning