Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Existing entropy-based reasoning methods struggle to balance response length and accuracy. This work proposes a Conditional Entropy Shaping (CES) framework that dynamically regulates the conditional entropy of each token during large language model generation via DAPO reinforcement learning, adaptively adjusting reasoning depth based on problem difficulty. The key innovation lies in a conditional bidirectional policy: suppressing high-entropy branches along correct reasoning paths to enhance conciseness, while encouraging high entropy on erroneous paths to improve exploration and error correction. Experiments demonstrate that CES simultaneously improves accuracy and reduces output length across twelve mathematical benchmarks on both DeepSeek-R1-Distill-7B and 1.5B models, while maintaining robust performance on out-of-domain tasks.
📝 Abstract
Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.
Problem

Research questions and friction points this paper is trying to address.

entropy-based reasoning
response length
accuracy trade-off
Large Language Models
reasoning capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Entropy Shaping
adaptive reasoning
token-level entropy
large language models
DAPO
🔎 Similar Papers
No similar papers found.
S
Shuyu Wei
Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University
J
Jian Sun
Unisound AI Technology Co., Ltd.
D
Delai Qiu
Unisound AI Technology Co., Ltd.
Yining Wang
Yining Wang
NLP Reseacher, Unisound
Natural Language ProcessingMachine Translation
S
Shengping Liu
Unisound AI Technology Co., Ltd.
J
Jiaen Liang
Unisound AI Technology Co., Ltd.
Y
Ying Fu
Unisound AI Technology Co., Ltd.
W
Wei Huang
Unisound AI Technology Co., Ltd.
J
Jitao Sang
Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University