Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM)-based agents suffer from rigid planning strategies in long-horizon tasks: fixed planning incurs excessive computational overhead and performance degradation, while no planning limits capability. Method: We propose the first dynamic planning framework enabling agents to autonomously decide—during inference—when to invoke a planning module for adaptive computational resource allocation. Our approach employs two-stage training: supervised fine-tuning on diverse synthetic data, followed by reinforcement learning in long-horizon environments to optimize the dynamic planning policy. Contribution/Results: This work introduces the first LLM-based agent capable of adaptive, fine-grained, test-time computation scheduling within sequential decision-making, with human instruction-guided behavioral modulation. Evaluated in the Crafter environment, our method significantly improves sample efficiency and task completion rate; moreover, it can surpass standalone performance when augmented with human-provided plans.

Technology Category

Application Category

📝 Abstract
Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities. To our knowledge, this work is the first to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, paving the way for more efficient, adaptive, and controllable agentic systems.
Problem

Research questions and friction points this paper is trying to address.

Optimizing test-time compute allocation for LLM agents
Balancing planning costs with performance in long-horizon tasks
Enabling dynamic decision-making for planning in sequential environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic planning framework for LLM agents
Two-stage training with SFT and RL
Flexible test-time compute allocation strategy
🔎 Similar Papers
No similar papers found.