Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and suboptimal performance commonly observed in online reinforcement learning with large language models, which often stem from an imbalance between exploration and exploitation. To mitigate this issue, the authors propose the IB-TPO framework, which leverages information bottleneck theory to construct an IB-Score metric. This metric, integrated with tree-based policy optimization and an IB-guided sampling mechanism, enables fine-grained control over the exploration–exploitation trade-off. The proposed method improves trajectory sampling efficiency by up to 50% under the same token budget, facilitating more effective Monte Carlo estimation. Empirical results demonstrate that IB-TPO outperforms the GRPO baseline by 2.9%–3.6% on standard benchmarks and surpasses existing state-of-the-art online reinforcement learning approaches.
📝 Abstract
Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.
Problem

Research questions and friction points this paper is trying to address.

exploration-exploitation trade-off
online reinforcement learning
large language models
policy optimization
information bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Bottleneck
Tree-based Policy Optimization
Exploration-Exploitation Balance
Online Reinforcement Learning
Large Language Models
🔎 Similar Papers
No similar papers found.
Hao Jiang
Hao Jiang
Alibaba Group
LLM & AIGC
S
Shurui Li
Alibaba Cloud Computing, Alibaba Group
T
Tianpeng Bu
Alibaba Cloud Computing, Alibaba Group
B
Bowen Xu
Alibaba Cloud Computing, Alibaba Group
X
Xin Liu
Alibaba Cloud Computing, Alibaba Group
Q
Qihua Chen
Alibaba Cloud Computing, Alibaba Group
Hongtao Duan
Hongtao Duan
Nanjing Institute of Geography and Limnology, Chinese Academy of Sciences
Ocean color remote sensingLake remote sensing
L
Lulu Hu
Alibaba Cloud Computing, Alibaba Group
B
Bin Yang
Alibaba Cloud Computing, Alibaba Group
M
Minying Zhang
Alibaba Cloud Computing, Alibaba Group