🤖 AI Summary
This work investigates whether pretrained foundation models are necessary for efficient exploration in reinforcement learning, focusing on their computational impact on exploration efficiency during inference. We propose the “sampling oracle” framework and formally define “coverage” as a measure of implicit knowledge encoded in foundation models, proving it determines a fundamental lower bound on exploration runtime. We demonstrate that inference-time intervention outperforms training-time policy modification, and that multi-round interaction reduces sequence-level coverage requirements to token-level, substantially improving efficiency. Based on these insights, we design SpannerSampling: under sufficient coverage, it achieves optimal data efficiency and polynomial-time complexity—guarantees we prove unattainable for training-time methods within polynomial time.
📝 Abstract
Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.