Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large language models (LLMs) suffer from insufficient exploration and poor long-term credit assignment in sparse-reward, long-horizon decision-making tasks. To address this, we propose GLIDER, an offline hierarchical reinforcement learning framework that introduces the first parameter-efficient, task-agnostic LLM-based hierarchical decision architecture: a high-level planner generates abstract, temporally extended plans, while a low-level executor performs subtasks via chain-of-thought reasoning—enabling explicit task decomposition and spatiotemporal abstraction. GLIDER integrates offline hierarchical RL, chain-of-thought supervision, abstract plan distillation, and low-level skill transfer, and further supports rapid online adaptation to non-stationary environments. Evaluated on ScienceWorld and ALFWorld, GLIDER achieves substantial improvements in long-horizon task completion rates and cross-task generalization. Our work establishes a novel paradigm for structuring LLMs for complex, sequential decision-making.

Technology Category

Application Category

📝 Abstract

While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with long-horizon decision-making tasks

Proposes hierarchical reinforcement learning for efficient LLM policies

Enhances exploration and learning in sparse-reward scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical reinforcement learning for LLM policies

Decomposing tasks into coherent sub-tasks

Fast online adaptation with task-agnostic skills

🔎 Similar Papers

Efficient Sequential Decision Making with Large Language Models