IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of long-chain-of-thought reasoning in large language models and the lack of fine-grained control over token-level reasoning resources. To this end, the authors propose IAPO, an information-aware post-training framework that, for the first time, incorporates conditional mutual information into token-level reward shaping. By quantifying the conditional mutual information between each intermediate token and the final answer, IAPO constructs an advantage signal that guides the model to prioritize high-information steps while suppressing inefficient exploration. Evaluated across multiple reasoning benchmarks, IAPO achieves up to 36% reduction in reasoning length while simultaneously improving accuracy, significantly outperforming existing token-efficient reinforcement learning approaches.

Technology Category

Application Category

📝 Abstract
Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.
Problem

Research questions and friction points this paper is trying to address.

token-efficient reasoning
reasoning verbosity
reward shaping
inference cost
reasoning steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-Aware Policy Optimization
Token-Efficient Reasoning
Conditional Mutual Information
Advantage Shaping
Post-Training
🔎 Similar Papers
No similar papers found.