Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from repetitive reasoning patterns and entrapment in local optima during multi-step reasoning, primarily due to sparse extrinsic rewards and insufficient exploration. Method: This paper proposes an intrinsic motivation–based exploration enhancement mechanism. Its core innovation is a lightweight “Coin Flipping Network” that dynamically generates intrinsic rewards by jointly estimating pseudo-counts and modeling cognitive uncertainty—thereby balancing novelty-driven exploration with task-oriented learning. Integrated into a reinforcement learning framework, the method explicitly tracks the exploration state of reasoning trajectories using policy optimization algorithms such as GRPO. Contribution/Results: Experiments demonstrate substantial improvements on complex reasoning benchmarks: the approach yields higher-quality, more diverse chain-of-thought (CoT) generations, consistently escapes suboptimal solutions, and enhances both robustness and generalization in LLM-based reasoning—offering a principled pathway toward more reliable and adaptive reasoning systems.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (LLMs). However, prevalent RL paradigms still lean on sparse outcome-based rewards and limited exploration, which often drives LLMs toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for LLM reasoning and introduce MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count-based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while preserving the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-step reasoning in LLMs with intrinsic rewards
Overcoming repetitive patterns in language model reasoning
Improving exploration for better solution discovery in RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Count-based intrinsic rewards motivate LLM exploration
Lightweight Coin Flipping Network estimates epistemic uncertainty
Integrated with policy optimization for diverse reasoning chains
🔎 Similar Papers
No similar papers found.
X
Xuan Zhang
Fudan University, Shanghai Innovation Institute, Tencent Youtu Lab
R
Ruixiao Li
Fudan University, Shanghai Innovation Institute
Z
Zhijian Zhou
Fudan University, Shanghai Innovation Institute, Tencent Youtu Lab
Long Li
Long Li
Research Staff Member, Inspur Group Co., Ltd.
Software Defined NetworkingNetwork Performance Optimization
Yulei Qin
Yulei Qin
Tencent YouTu Lab
Language ModelsComputer VisionMedical Image Analysis
K
Ke Li
Tencent Youtu Lab
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
X
Xiaoyu Tan
Tencent Youtu Lab
C
Chao Qu
Fudan University
Y
Yuan Qi
Fudan University, Shanghai Innovation Institute