🤖 AI Summary
This work addresses the challenge of enabling large language models to self-improve during inference without updating their parameters. The authors propose In-Context Policy Optimization (ICPO), a framework that dynamically refines model responses at inference time using either self-evaluation or external reward signals. Theoretically, they establish—for the first time—that a single-layer linear self-attention model, under specific pretraining objectives, can emulate policy optimization. Methodologically, they introduce a robust self-reflection mechanism based on minimum-entropy selection (ME-ICPO), which integrates Fisher-weighted logit-matching pretraining, in-context policy optimization, and majority voting. Evaluated on standard mathematical reasoning benchmarks, the approach achieves state-of-the-art performance, significantly outperforming existing inference-time optimization methods while maintaining low computational overhead.
📝 Abstract
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.