🤖 AI Summary
To address insufficient coordination between high-level planning and fine-grained manipulation in long-horizon vision-language-action (VLA) tasks, this paper proposes a unified multimodal framework. First, it introduces the “multimodal manual”—a novel intermediate representation mapping goal states to executable steps—and designs a manual-guided chain-of-thought reasoning mechanism that jointly incorporates explicit control constraints and implicit representations. Second, it constructs a high-fidelity digital twin toolkit based on 3D Gaussian splatting to enable automated, high-quality synthetic data generation. Third, it adopts a Mixture-of-Transformers (MoT) architecture to jointly model visual, linguistic, and action modalities. Evaluated on LEGO assembly and object rearrangement tasks, the method achieves a 32% higher average success rate in real-world settings compared to the best hierarchical baseline, significantly improving end-to-end executability and generalization for long-horizon VLA tasks.
📝 Abstract
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.