🤖 AI Summary
Current vision-language-action (VLA) models predominantly rely on end-to-end mapping, lacking explicit multi-step embodied reasoning capabilities and thus struggling with complex, dynamic tasks. To address this, we propose a dual-system collaborative framework: an upper-level planner leverages action-aligned visual reward signals to guide a multimodal large language model in generating interpretable, stepwise embodied reasoning plans, which are then compressed into compact visual plan latent variables; a lower-level conditional action execution network translates these latents into robust, low-level motor commands. This design synergistically integrates multimodal large language modeling, reinforcement learning, and visual latent representation learning, significantly improving the decoupling and coordination between high-level planning and low-level control. Experiments demonstrate state-of-the-art performance on embodied reasoning and robotic manipulation benchmarks, achieving few-shot adaptation, long-horizon planning, and dynamic self-correction—thereby substantially enhancing generalization and fault tolerance in complex real-world tasks.
📝 Abstract
Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.