ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address insufficient coordination between high-level planning and fine-grained manipulation in long-horizon vision-language-action (VLA) tasks, this paper proposes a unified multimodal framework. First, it introduces the “multimodal manual”—a novel intermediate representation mapping goal states to executable steps—and designs a manual-guided chain-of-thought reasoning mechanism that jointly incorporates explicit control constraints and implicit representations. Second, it constructs a high-fidelity digital twin toolkit based on 3D Gaussian splatting to enable automated, high-quality synthetic data generation. Third, it adopts a Mixture-of-Transformers (MoT) architecture to jointly model visual, linguistic, and action modalities. Evaluated on LEGO assembly and object rearrangement tasks, the method achieves a 32% higher average success rate in real-world settings compared to the best hierarchical baseline, significantly improving end-to-end executability and generalization for long-horizon VLA tasks.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.

Problem

Research questions and friction points this paper is trying to address.

Generates multimodal manuals for robotic tasks

Enables planning-to-action via Manual Chain-of-Thought reasoning

Improves success rates in long-horizon manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multimodal manuals with images, prompts, and text

Uses Manual Chain-of-Thought reasoning for explicit and implicit guidance

Trains planning expert with data from 3D Gaussian Splatting toolkit

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance