OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing vision-language-action (VLA) models rely solely on RGB vision, limiting their physical-world perception capabilities. This work proposes a multimodal spatial intelligence framework that integrates infrared, millimeter-wave radar, and microphone array signals into a unified sensor-masked image representation—preserving RGB statistical properties while enabling cross-hardware interoperability and efficient transfer learning. A lightweight sensor projector aligns heterogeneous sensory inputs to a pre-trained RGB-based VLA backbone, enabling end-to-end multimodal VLA modeling. Evaluated on real-world manipulation tasks, our approach achieves an average success rate of 84%, outperforming RGB-only and raw-sensor baselines by +59% and +28%, respectively, demonstrating substantial gains in generalization and data efficiency. Our core contributions are: (i) the first sensor-masked image representation paradigm for VLAs, and (ii) a low-cost, hardware-agnostic, and highly generalizable multimodal VLA architecture.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

Problem

Research questions and friction points this paper is trying to address.

Integrating multiple sensors beyond RGB for robotic perception

Developing unified representation for multimodal sensor inputs

Enhancing manipulation capabilities through multisensory vision-language-action models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates infrared radar microphone sensors

Uses sensor-masked unified image representation

Builds on RGB-pretrained VLA backbone architecture

🔎 Similar Papers

Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment

2024-07-25Citations: 2