ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient coordination between high-level planning and fine-grained manipulation in long-horizon vision-language-action (VLA) tasks, this paper proposes a unified multimodal framework. First, it introduces the “multimodal manual”—a novel intermediate representation mapping goal states to executable steps—and designs a manual-guided chain-of-thought reasoning mechanism that jointly incorporates explicit control constraints and implicit representations. Second, it constructs a high-fidelity digital twin toolkit based on 3D Gaussian splatting to enable automated, high-quality synthetic data generation. Third, it adopts a Mixture-of-Transformers (MoT) architecture to jointly model visual, linguistic, and action modalities. Evaluated on LEGO assembly and object rearrangement tasks, the method achieves a 32% higher average success rate in real-world settings compared to the best hierarchical baseline, significantly improving end-to-end executability and generalization for long-horizon VLA tasks.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.
Problem

Research questions and friction points this paper is trying to address.

Generates multimodal manuals for robotic tasks
Enables planning-to-action via Manual Chain-of-Thought reasoning
Improves success rates in long-horizon manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multimodal manuals with images, prompts, and text
Uses Manual Chain-of-Thought reasoning for explicit and implicit guidance
Trains planning expert with data from 3D Gaussian Splatting toolkit
🔎 Similar Papers
No similar papers found.
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
H
Hao Chen
The Chinese University of Hong Kong
R
Runzhong Huang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Qingpo Wuwu
Qingpo Wuwu
Imperial College London | Peking University
Neural RenderingPhysical SimulationPDEs Solving
Zhuoyang Liu
Zhuoyang Liu
Peking University
Embodied AIComputer Vision
X
Xiaoqi Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Ying Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
P
Peng Jia
Simplexity Robotics
P
Pheng-Ann Heng
The Chinese University of Hong Kong
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models