🤖 AI Summary
Existing methods struggle to efficiently optimize sparsely occurring secondary behavioral preferences—such as thoroughness or expertise—in language models while preserving primary task accuracy, due to sparse rewards and low training efficiency. This work proposes a Vector-Guided Policy Optimization framework that extends GRPO by introducing behavior-aligned guidance vectors. By sampling trajectories under varying guidance intensities, the approach enables controllable optimization of behavioral strength and enhances policy learning efficiency through internalization of these vectors. The method reformulates sparse behavioral control as tunable sampling from a guidance distribution, offering both theoretical guarantees and empirical effectiveness. Experiments on benchmarks including MATH and MMLU-Pro demonstrate significant improvements in controllability across four behavioral preferences while maintaining or even improving primary task accuracy, outperforming baselines such as reward shaping and teacher trajectory distillation.
📝 Abstract
Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.