🤖 AI Summary
Large language models (LLMs) lack explicit planning capabilities for interactive long-horizon reasoning tasks—e.g., tool use, social reasoning, and multi-turn dialogue—and existing online reinforcement learning (RL) fine-tuning approaches suffer from high computational cost and incompatibility with API-closed LLMs, leaving prompt engineering as the dominant—but brittle—solution.
Method: We propose a lightweight, search-free, goal-conditioned value-guidance mechanism: (i) modeling value at the *reasoning-step* (not full-action) level; (ii) an offline goal-conditioned RL framework enabling value augmentation for API-restricted models; and (iii) a collaborative LLM–value-function architecture with deterministic planning.
Contribution/Results: Our method significantly outperforms prompt-engineering and online RL baselines across multiple interactive reasoning benchmarks, achieving superior performance, minimal inference overhead, and strong scalability—without requiring online search or model-specific API access.
📝 Abstract
Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.