🤖 AI Summary
This work addresses the limitations of large language models in open-ended generation tasks, where reliance on external evaluators often leads to reward hacking and hinders autonomous, continuous improvement. To overcome this, the authors propose the G-Zero framework, which enables co-evolution between a generator and a proposer model without any external supervision. The framework introduces a Hint-δ intrinsic reward mechanism derived from prediction shifts induced by self-generated prompts. The proposer is trained via GRPO to produce challenging queries, while the generator is optimized using DPO, leveraging dynamically generated internal supervision signals. Theoretical analysis establishes that under sufficient exploration and low-noise pseudo-labeling conditions, standard DPO guarantees iterative near-optimality, offering a scalable pathway toward data-free self-evolution of large language models in open domains.
📝 Abstract
Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-$δ$, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.