MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models neglect critical robotic multimodal sensory inputs—such as tactile signals and 3D point clouds—limiting their capacity to model physical interaction and contact dynamics. To address this, we propose an end-to-end multimodal VLA framework that replaces modality-specific encoders with a large language model (LLM) as a unified perceptual backbone, enabling joint, token-level processing of images, point clouds, and tactile data. We introduce a position-aligned cross-modal fusion mechanism and augment training with generative future multimodal state prediction, explicitly strengthening physical dynamic reasoning. Evaluated on real-world complex contact-intensive tasks, our method achieves 12% and 24% absolute improvements in action accuracy over state-of-the-art 2D- and 3D-based VLA models, respectively, while demonstrating superior generalization to unseen scene configurations.

Technology Category

Application Category

📝 Abstract
Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: https://sites.google.com/view/open-mla
Problem

Research questions and friction points this paper is trying to address.

Enhancing robotic manipulation through multisensory perception and forecasting
Aligning 2D images, 3D point clouds, and tactile tokens collaboratively
Generating future multisensory objectives for robust action planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses encoder-free multimodal alignment for perception
Repurposes large language model as perception module
Implements future multisensory generation for physical dynamics
🔎 Similar Papers
No similar papers found.
Zhuoyang Liu
Zhuoyang Liu
Peking University
Embodied AIComputer Vision
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
J
Jiadong Xu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
N
Nuowei Han
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
H
Hao Chen
CUHK
K
Kaichen Zhou
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
K
Kai Chin Hsieh
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
K
Kun Wu
Beijing Innovation Center of Humanoid Robotics
Zhengping Che
Zhengping Che
X-Humanoid
Embodied AIDeep Learning
J
Jian Tang
Beijing Innovation Center of Humanoid Robotics
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models