ProAct: Agentic Lookahead in Interactive Environments

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the limitations of large language model (LLM) agents in long-horizon planning tasks, where performance is hindered by the accumulation of simulation errors. The authors propose ProAct, a framework that internalizes precise forward-looking reasoning through a two-stage training process: first, supervised fine-tuning on environment-search-derived trajectories, followed by policy gradient optimization using a plug-and-play Monte Carlo Critic (MC-Critic) that provides low-variance value estimates. Key innovations include Grounded LookAhead Distillation, which compresses search trees into causal reasoning chains, and a lightweight environment rollout with value calibration mechanism. Experiments demonstrate that a 4B-parameter ProAct model significantly improves planning accuracy on tasks such as 2048 and Sokoban, outperforming all open-source baselines and matching the performance of leading closed-source models, while exhibiting strong generalization capabilities.

Technology Category

Application Category

📝 Abstract

Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct

Problem

Research questions and friction points this paper is trying to address.

long-horizon planning

interactive environments

compounding errors

future state simulation

LLM agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

ProAct

Grounded LookAhead Distillation

Monte-Carlo Critic