🤖 AI Summary
This work addresses the challenge of post-training large language models (LLMs) in the absence of ground-truth labels. We propose the “reasoning-as-supervision” paradigm: diverse reasoning trajectories are generated via parallel rollouts; multi-path outputs are consolidated through a frozen anchor model; and a reference-free reward mechanism is constructed by jointly leveraging program-equivalence verification and independent LLM-based judging. Crucially, this approach transforms the reasoning process itself into an intrinsic self-supervised signal—eliminating reliance on human annotations or external feedback. The synthesized reference targets enable automatic bias correction, and performance scales with rollout size. Evaluated on Gemma, Qwen, and Llama series models, our method achieves +27% and +12% absolute gains on MATH-500 and HealthBench, respectively; when integrated with reinforcement learning, improvements reach +33% and +30%, substantially surpassing conventional selection-based approaches.
📝 Abstract
Where do learning signals come from when there is no ground truth in post-training? We propose turning exploration into supervision through Compute as Teacher (CaT), which converts the model's own exploration at inference-time into reference-free supervision by synthesizing a single reference from a group of parallel rollouts and then optimizing toward it. Concretely, the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles omissions and contradictions to estimate a reference, turning extra inference-time compute into a teacher signal. We turn this into rewards in two regimes: (i) verifiable tasks use programmatic equivalence on final answers; (ii) non-verifiable tasks use self-proposed rubrics-binary, auditable criteria scored by an independent LLM judge, with reward given by the fraction satisfied. Unlike selection methods (best-of-N, majority, perplexity, or judge scores), synthesis may disagree with the majority and be correct even when all rollouts are wrong; performance scales with the number of rollouts. As a test-time procedure, CaT improves Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B (up to +27% on MATH-500; +12% on HealthBench). With reinforcement learning (CaT-RL), we obtain further gains (up to +33% and +30%), with the trained policy surpassing the initial teacher signal.