OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models rely solely on RGB vision, limiting their physical-world perception capabilities. This work proposes a multimodal spatial intelligence framework that integrates infrared, millimeter-wave radar, and microphone array signals into a unified sensor-masked image representation—preserving RGB statistical properties while enabling cross-hardware interoperability and efficient transfer learning. A lightweight sensor projector aligns heterogeneous sensory inputs to a pre-trained RGB-based VLA backbone, enabling end-to-end multimodal VLA modeling. Evaluated on real-world manipulation tasks, our approach achieves an average success rate of 84%, outperforming RGB-only and raw-sensor baselines by +59% and +28%, respectively, demonstrating substantial gains in generalization and data efficiency. Our core contributions are: (i) the first sensor-masked image representation paradigm for VLAs, and (ii) a low-cost, hardware-agnostic, and highly generalizable multimodal VLA architecture.

Technology Category

Application Category

📝 Abstract
Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.
Problem

Research questions and friction points this paper is trying to address.

Integrating multiple sensors beyond RGB for robotic perception
Developing unified representation for multimodal sensor inputs
Enhancing manipulation capabilities through multisensory vision-language-action models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates infrared radar microphone sensors
Uses sensor-masked unified image representation
Builds on RGB-pretrained VLA backbone architecture
H
Heyu Guo
Princeton University
S
Shanmu Wang
University of California, Los Angeles
R
Ruichun Ma
Microsoft Research Asia
S
Shiqi Jiang
Microsoft Research Asia
Yasaman Ghasempour
Yasaman Ghasempour
Assistant Professor, Princeton University
mmWave and TerahertzWireless CommunicationWireless SensingWireless Security
Omid Abari
Omid Abari
University of California, Los Angeles
Baining Guo
Baining Guo
Distinguished Scientist, Microsoft Research
Computer GraphicsGraphicsVirtual RealityGeometric Modeling
L
Lili Qi
Microsoft Research Asia