🤖 AI Summary
This work proposes a post-training framework aimed at enhancing the generalization of language models on complex and novel tasks while improving test-time compute scalability. The key innovation lies in explicitly incorporating optimistic exploration into the training objective for the first time, achieved through ensemble-based reinforcement learning that jointly optimizes exploration and exploitation. Central to this approach are ensemble policy optimization, a reformulated advantage function, and a novel algorithm—Polychromic Exploratory Policy Optimization (Poly-EPO)—designed to generate sets of reasoning paths that balance accuracy and diversity. Experimental results demonstrate that the method significantly improves pass@$k$ coverage across multiple reasoning benchmarks, enhances output diversity, and effectively leverages additional test-time computation to scale performance.
📝 Abstract
Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.