Are Vision-Language Models Truly Understanding Multi-vision Sensor?

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) struggle to comprehend the physical semantics of multimodal sensor imagery—such as thermal, depth, and X-ray images—achieving only superficial cross-modal alignment while lacking sensor-aware perception and reasoning capabilities. To address this, we introduce MS-PR, the first benchmark dedicated to multimodal sensor-based physical reasoning. We further propose Diversity-based Negative Attributes (DNA), a novel optimization framework that explicitly models physical semantic disparities across imaging modalities via sensor-aware prompting, cross-modal negative sample mining, and enhanced contrastive learning. Experiments demonstrate that DNA significantly improves VLMs’ accuracy on MS-PR, effectively bridging the semantic gap between visual representations and underlying sensor physics. This work establishes a new paradigm for embodied intelligence and real-world multimodal visual understanding in domains including healthcare and industrial inspection.

Technology Category

Application Category

📝 Abstract
Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor's unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.
Problem

Research questions and friction points this paper is trying to address.

Visual and Language Models
Diverse Visual Information
Multimodal Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Sensor Perception and Reasoning (MS-PR)
Diversified Negative Attributes (DNA) Optimization
Visual and Linguistic Model (VLM) Enhancement
🔎 Similar Papers
No similar papers found.
S
Sangyun Chung
Integrated Vision Language Lab, KAIST, South Korea
Y
Youngjoon Yu
Integrated Vision Language Lab, KAIST, South Korea
Y
Youngchae Chee
Integrated Vision Language Lab, KAIST, South Korea
S
Se Yeon Kim
Integrated Vision Language Lab, KAIST, South Korea
Byung-Kwan Lee
Byung-Kwan Lee
NVIDIA
Computer VisionMachine LearningVision Language Model
Y
Yonghyun Ro
Integrated Vision Language Lab, KAIST, South Korea