Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Reinforcement learning (RL) for enhancing large language model (LLM) reasoning faces a critical bottleneck: sparse terminal rewards provide insufficient supervision for intermediate reasoning steps, hindering exploration and exacerbating path deviation, thereby impeding multi-step iterative optimization. To address this, we propose an intrinsic-motivation-driven dense reward exploration framework. Our method introduces three novel components: trajectory-aware exploration rewards, dynamic reward scaling, and advantage-preserving reward shaping—jointly enabling broad exploration, training stability, and policy consistency. By integrating token-level exploration incentives with advantage-function constraints, our approach achieves significant performance gains across three public benchmarks—including Countdown-4—with up to a 22.39% accuracy improvement on the most challenging tasks. This work establishes a scalable, robust paradigm for RL-based reasoning optimization in LLMs.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Regularized Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for multi-step reasoning processes. Specifically, sparse reward signals fail to deliver effective or sufficient feedback, particularly for challenging problems. Furthermore, such reward structures induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across ipntermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a novel method designed to both deliver dense rewards and amplify explorations in the RL-based training paradigm. i-MENTOR introduces three key innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; dynamic reward scaling to stabilize exploration and exploitation in large action spaces; and advantage-preserving reward implementation that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across three public datasets demonstrate i-MENTOR's effectiveness with a 22.39% improvement on the difficult dataset Countdown-4.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM reasoning with intrinsic motivation to overcome sparse reward limitations

Addressing exploration bias in RL methods for multi-step reasoning tasks

Enhancing reward mechanisms to stabilize and guide complex reasoning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-aware exploration rewards mitigate token-level bias

Dynamic reward scaling stabilizes exploration and exploitation

Advantage-preserving reward maintains distribution integrity

🔎 Similar Papers

No similar papers found.

Authors to Follow