Vector Policy Optimization: Training for Diversity Improves Test-Time Search

πŸ“… 2026-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

197K/year
πŸ€– AI Summary
Large language models optimized with scalar rewards often produce low-entropy responses, failing to meet the demand for output diversity during inference. To address this limitation, this work proposes Vector Policy Optimization (VPO), which extends scalar rewards to vector-valued rewards and explicitly trains policies via multi-objective reinforcement learning to generate diverse outputs that span various reward trade-offs. VPO incorporates an improved GRPO advantage estimator capable of leveraging multidimensional task feedbackβ€”such as per-test-case correctness in code generation. Experimental results demonstrate that VPO matches or surpasses the strongest scalar RL baselines across four tasks in terms of both pass@k and best@k metrics, with performance gains becoming more pronounced as the search budget increases. Notably, VPO successfully resolves cases where GRPO completely fails, particularly in evolutionary search settings.
πŸ“ Abstract
Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.
Problem

Research questions and friction points this paper is trying to address.

diversity
test-time search
reward functions
language models
response distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector Policy Optimization
diverse response generation
vector-valued rewards
test-time search
reinforcement learning for LLMs
πŸ”Ž Similar Papers
No similar papers found.