RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the low sample efficiency and poor generalization of end-to-end autonomous agents in open-world settings. We propose RIG—the first agent framework that jointly models reasoning capability and world modeling (image generation) within a single differentiable policy. Its core contributions are: (1) joint optimization of action reasoning and next-frame prediction to explicitly capture intrinsic couplings among environmental dynamics, actions, and decisions; (2) a progressive trajectory-augmentation data pipeline enabling a closed-loop of reasoning–acting–imagining and reasoning-driven self-correction; and (3) co-training of latent-space action representations and image generation. Experiments demonstrate that RIG achieves over 17× improvement in sample efficiency and significantly outperforms prior methods in cross-task generalization, environmental robustness, and multi-task interoperability, while supporting continual test-time extension.

Technology Category

Application Category

📝 Abstract

Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17 imes$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

Problem

Research questions and friction points this paper is trying to address.

Synergizing reasoning and imagination in end-to-end generalist policy

Improving learning efficiency and generalization in agent policies

Enhancing robustness and performance through joint reasoning-imagination modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergizes reasoning and imagination in end-to-end policy

Joint learning of reasoning and next image generation

Enables test-time scaling for performance enhancement

🔎 Similar Papers

Policy Learning with a Language Bottleneck