$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This work addresses the challenges of sparse rewards, weak credit assignment, and scarce labeled data in deep-search agents for complex information retrieval by proposing π-Play, a multi-agent self-evolution framework. π-Play leverages question construction paths (QCPs)—naturally emerging during self-play—as privileged context to enable privileged self-distillation, thereby generating dense supervision signals. This mechanism transforms conventional sparse-reward self-play into an efficient self-evolution loop. Notably, without relying on human feedback or external annotations, π-Play surpasses fully supervised search agents under zero external data conditions and achieves a 2–3× improvement in evolution efficiency over traditional self-play methods.

Technology Category

Application Category

📝 Abstract
Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a multi-agent self-evolution framework. In $π$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.
Problem

Research questions and friction points this paper is trying to address.

self-play
sparse rewards
credit assignment
labeled data
multi-agent
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-play
privileged information
self-distillation
multi-agent
question construction path
🔎 Similar Papers