Poly-EPO: Training Exploratory Reasoning Models

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work proposes a post-training framework aimed at enhancing the generalization of language models on complex and novel tasks while improving test-time compute scalability. The key innovation lies in explicitly incorporating optimistic exploration into the training objective for the first time, achieved through ensemble-based reinforcement learning that jointly optimizes exploration and exploitation. Central to this approach are ensemble policy optimization, a reformulated advantage function, and a novel algorithm—Polychromic Exploratory Policy Optimization (Poly-EPO)—designed to generate sets of reasoning paths that balance accuracy and diversity. Experimental results demonstrate that the method significantly improves pass@$k$ coverage across multiple reasoning benchmarks, enhances output diversity, and effectively leverages additional test-time computation to scale performance.

Technology Category

Application Category

📝 Abstract

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.

Problem

Research questions and friction points this paper is trying to address.

exploratory reasoning

language models

reinforcement learning

generalization

test-time compute

Innovation

Methods, ideas, or system contributions that make the work stand out.

exploratory reasoning

set reinforcement learning

Poly-EPO