Allegory of the Cave: Measurement-Grounded Vision-Language Learning

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the limitations of conventional vision-language models (VLMs), which rely on post-processed RGB images and often suffer from inaccurate visual grounding due to the loss of original sensor information during rendering. To overcome this, the authors propose PRISM-VL, a novel framework that pioneers systematic exploration of vision-language learning in the measurement domain. By shifting input representation to raw sensor measurements, PRISM-VL introduces Meas.-XYZ encoding, camera-parameter-conditioned modeling, and exposure-bracketing-based supervision aggregation, enabling effective knowledge transfer from RGB proxies to real-world measurements. Experiments demonstrate that PRISM-VL-8B substantially outperforms RGB-based baselines under challenging conditions such as low-light and HDR scenarios, achieving gains of 0.1074 in BLEU, 0.1071 in ROUGE-L, and an LLM-Judge accuracy of 82.66%, thereby validating the critical role of preserving raw measurement evidence for enhancing VLM grounding performance.
📝 Abstract
Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
RGB rendering
sensor measurements
grounding
measurement-domain
Innovation

Methods, ideas, or system contributions that make the work stand out.

measurement-grounded learning
RAW-to-XYZ
exposure-bracketed supervision
vision-language grounding
camera-conditioned reasoning