🤖 AI Summary
Existing vision-language-action (VLA) models rely solely on RGB vision, limiting their physical-world perception capabilities. This work proposes a multimodal spatial intelligence framework that integrates infrared, millimeter-wave radar, and microphone array signals into a unified sensor-masked image representation—preserving RGB statistical properties while enabling cross-hardware interoperability and efficient transfer learning. A lightweight sensor projector aligns heterogeneous sensory inputs to a pre-trained RGB-based VLA backbone, enabling end-to-end multimodal VLA modeling. Evaluated on real-world manipulation tasks, our approach achieves an average success rate of 84%, outperforming RGB-only and raw-sensor baselines by +59% and +28%, respectively, demonstrating substantial gains in generalization and data efficiency. Our core contributions are: (i) the first sensor-masked image representation paradigm for VLAs, and (ii) a low-cost, hardware-agnostic, and highly generalizable multimodal VLA architecture.
📝 Abstract
Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.