AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the credit assignment challenge in multi-turn agent reinforcement learning caused by sparse rewards by proposing an unsupervised response-level credit assignment method. The approach eliminates reliance on dense intermediate supervision signals and instead employs natural gradient–driven entropy modulation at the response level to adaptively balance exploration and exploitation during training. Its key innovation lies in elevating entropy analysis from the token level to the response level, thereby reducing sampling variance, and revealing that entropy drift is intrinsically governed by the product of the advantage function and relative response surprise. Experimental results demonstrate consistent effectiveness across multiple benchmarks and language models ranging from 1.5B to 32B parameters, achieving a 1.4% improvement over the current best baseline on SWE-bench-Verified.
📝 Abstract
Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to individual steps in an agent's action trajectory. A common remedy is to introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, but this increases supervision and tuning complexity and often generalizes poorly across tasks and domains. This paper presents AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to achieve a more effective exploration-exploitation trade-off. Theoretically, we elevate entropy analysis from the token level to the response level to reduce token sampling variance and show that entropy drift under natural gradients is intrinsically governed by the product of the advantage and the relative response surprisal. Specifically, we derive a practical proxy to reshape training dynamics, enabling a natural transition from exploration to exploitation. Extensive experiments across various benchmarks and models ranging from 1.5B to 32B parameters demonstrate the effectiveness of AEM, including a notable 1.4 percent gain when integrated into a state-of-the-art baseline on the highly challenging SWE-bench-Verified benchmark.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
reinforcement learning
sparse rewards
multi-turn tasks
large language model agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Entropy Modulation
Credit Assignment
Reinforcement Learning
Exploration-Exploitation Trade-off
Response-Level Entropy
🔎 Similar Papers
No similar papers found.