MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing vision-language-action (VLA) models neglect critical robotic multimodal sensory inputs—such as tactile signals and 3D point clouds—limiting their capacity to model physical interaction and contact dynamics. To address this, we propose an end-to-end multimodal VLA framework that replaces modality-specific encoders with a large language model (LLM) as a unified perceptual backbone, enabling joint, token-level processing of images, point clouds, and tactile data. We introduce a position-aligned cross-modal fusion mechanism and augment training with generative future multimodal state prediction, explicitly strengthening physical dynamic reasoning. Evaluated on real-world complex contact-intensive tasks, our method achieves 12% and 24% absolute improvements in action accuracy over state-of-the-art 2D- and 3D-based VLA models, respectively, while demonstrating superior generalization to unseen scene configurations.

Technology Category

Application Category

📝 Abstract

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: https://sites.google.com/view/open-mla

Problem

Research questions and friction points this paper is trying to address.

Enhancing robotic manipulation through multisensory perception and forecasting

Aligning 2D images, 3D point clouds, and tactile tokens collaboratively

Generating future multisensory objectives for robust action planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses encoder-free multimodal alignment for perception

Repurposes large language model as perception module

Implements future multisensory generation for physical dynamics

🔎 Similar Papers

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

2024-04-02IEEE/RJS International Conference on Intelligent RObots and SystemsCitations: 0

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist, Sensor and Systems Robotics (PhD)