HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenges of credit assignment and training instability in large language models when tackling long-horizon tasks with sparse rewards, which stem from flat policy architectures. To overcome these limitations, the authors propose HiPER, a hierarchical reinforcement learning framework that explicitly decomposes the policy into a high-level planner and a low-level executor. Central to HiPER is Hierarchical Advantage Estimation (HAE), which enables unbiased, low-variance credit assignment at both levels. Integrated with a subgoal mechanism and fine-tuning based on Qwen2.5-7B-Instruct, HiPER achieves state-of-the-art success rates of 97.4% on ALFWorld and 83.3% on WebShop—improving upon prior best methods by 6.6% and 8.3%, respectively. The framework demonstrates particularly strong performance in complex, long-horizon scenarios involving multiple interdependent subtasks.

Technology Category

Application Category

📝 Abstract

Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.

Problem

Research questions and friction points this paper is trying to address.

hierarchical reinforcement learning

credit assignment

large language model agents

sparse rewards

long-horizon tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Reinforcement Learning

Credit Assignment

Large Language Model Agents