TRE: Encouraging Exploration in the Trust Region

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue that standard entropy regularization in reinforcement learning for large language models often leads to ineffective exploration or even performance degradation due to the accumulation of tail risks. To mitigate this, the authors propose Trust Region Entropy (TRE), a novel approach that, for the first time, integrates trust region constraints into entropy regularization. By maximizing entropy within a local neighborhood of the current policy, TRE focuses exploration on high-confidence candidate tokens, thereby enhancing both exploration efficiency and reasoning coherence. Implemented within the PPO framework, the method combines localized entropy maximization with trust region constraints and demonstrates significant improvements over standard PPO, conventional entropy regularization, and other exploration baselines across benchmarks including MATH, Countdown, and HH.

Technology Category

Application Category

📝 Abstract
Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE-Encouraging-Exploration-in-the-Trust-Region.
Problem

Research questions and friction points this paper is trying to address.

entropy regularization
large language models
exploration
trust region
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust Region Entropy
Exploration
Large Language Models
Entropy Regularization
Reinforcement Learning
🔎 Similar Papers
No similar papers found.
C
Chao Huang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Y
Yujing Lu
Baidu Inc.
Q
Quangang Li
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
S
Shenghe Wang
Baidu Inc.
Y
Yan Wang
Baidu Inc.
Y
Yueyang Zhang
Baidu Inc.
Long Xia
Long Xia
Research Scientist, Baidu
information retrievaldata miningapplied machine learningrecommender system
J
Jiashu Zhao
Wilfrid Laurier University
Z
Zhiyuan Sun
Baidu Inc.
D
Daiting Shi
Baidu Inc.
Tingwen Liu
Tingwen Liu
Institute of Information Engineering, Chinese Academy of Sciences
Content SecurityNatural Language ProcessingKnowledge Graph