Provable and Practical In-Context Policy Optimization for Self-Improvement

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of enabling large language models to self-improve during inference without updating their parameters. The authors propose In-Context Policy Optimization (ICPO), a framework that dynamically refines model responses at inference time using either self-evaluation or external reward signals. Theoretically, they establish—for the first time—that a single-layer linear self-attention model, under specific pretraining objectives, can emulate policy optimization. Methodologically, they introduce a robust self-reflection mechanism based on minimum-entropy selection (ME-ICPO), which integrates Fisher-weighted logit-matching pretraining, in-context policy optimization, and majority voting. Evaluated on standard mathematical reasoning benchmarks, the approach achieves state-of-the-art performance, significantly outperforming existing inference-time optimization methods while maintaining low computational overhead.

Technology Category

Application Category

📝 Abstract

We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

in-context learning

test-time scaling

self-reflection

policy optimization

mathematical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Policy Optimization

test-time scaling

self-reflection