🤖 AI Summary
Current vision-language models (VLMs) struggle to comprehend the physical semantics of multimodal sensor imagery—such as thermal, depth, and X-ray images—achieving only superficial cross-modal alignment while lacking sensor-aware perception and reasoning capabilities. To address this, we introduce MS-PR, the first benchmark dedicated to multimodal sensor-based physical reasoning. We further propose Diversity-based Negative Attributes (DNA), a novel optimization framework that explicitly models physical semantic disparities across imaging modalities via sensor-aware prompting, cross-modal negative sample mining, and enhanced contrastive learning. Experiments demonstrate that DNA significantly improves VLMs’ accuracy on MS-PR, effectively bridging the semantic gap between visual representations and underlying sensor physics. This work establishes a new paradigm for embodied intelligence and real-world multimodal visual understanding in domains including healthcare and industrial inspection.
📝 Abstract
Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor's unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.