🤖 AI Summary
Large language models (LLMs) suffer from insufficient exploration and poor long-term credit assignment in sparse-reward, long-horizon decision-making tasks. To address this, we propose GLIDER, an offline hierarchical reinforcement learning framework that introduces the first parameter-efficient, task-agnostic LLM-based hierarchical decision architecture: a high-level planner generates abstract, temporally extended plans, while a low-level executor performs subtasks via chain-of-thought reasoning—enabling explicit task decomposition and spatiotemporal abstraction. GLIDER integrates offline hierarchical RL, chain-of-thought supervision, abstract plan distillation, and low-level skill transfer, and further supports rapid online adaptation to non-stationary environments. Evaluated on ScienceWorld and ALFWorld, GLIDER achieves substantial improvements in long-horizon task completion rates and cross-task generalization. Our work establishes a novel paradigm for structuring LLMs for complex, sequential decision-making.
📝 Abstract
While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.