🤖 AI Summary
This work addresses key limitations in early-stage self-play training for self-evolving reasoning agents, which often generate invalid questions due to insufficient relational context and rely solely on binary rewards that fail to exploit useful signals from partially correct reasoning trajectories. To overcome these challenges, the authors propose a lightweight unsupervised approach that leverages large language models to construct relationally grounded questions by guiding subgraph extraction from knowledge graphs. Additionally, they introduce a path coverage reward (WCR) mechanism that provides fine-grained feedback during reasoning by evaluating intermediate entities along solution paths. This method uniquely utilizes knowledge graph paths both for relation-aware question generation and reward shaping. Extensive experiments across seven question-answering benchmarks and nine model configurations demonstrate consistent and significant improvements over standard self-play methods, with particularly notable gains on multi-hop reasoning tasks.
📝 Abstract
Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-play training, while the Solver receives only a binary outcome reward that discards useful signal from partially on-track search trajectories. We address both bottlenecks by reusing knowledge-graph paths as construction-derived intermediate supervision for both question construction and reward shaping. First, we ground question construction in LLM-guided knowledge-graph subgraphs, providing relational context for the Proposer. Second, we observe that constructing and solving a multi-hop question can involve overlapping intermediate entities: the factual bridges used to formulate the question may provide approximate waypoints for answering it. Exploiting this overlap, we introduce Waypoint Coverage Reward (WCR), which grants graded partial credit to incorrect Solver trajectories according to their coverage of entities on the construction path, while preserving full reward for correct answers. Across seven QA benchmarks and nine model configurations, our approach improves the average score over standard SSP in all configurations, including notable gains on multi-hop QA tasks. These results suggest that knowledge-graph paths can be reused as lightweight intermediate supervision, providing both relational guidance and process feedback without additional task-specific human annotations or manually labeled process steps.