PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing vision-language models rely solely on RGB inputs, limiting their ability to accurately perceive optically ambiguous objects such as reflective or transparent surfaces due to a lack of physical world modeling. This work proposes PolarVLM, the first vision-language model that integrates physical parameters from polarization imaging. By introducing a dual-stream multimodal architecture and a two-stage progressive training strategy, PolarVLM retains general visual understanding capabilities while achieving physical awareness. The study contributes the first polarization-aware visual question answering benchmark, PolarVQA, along with 75K physically grounded instruction-tuning samples. Evaluated across five tasks, PolarVLM outperforms RGB-only baselines by an average of 25.4%, with notable improvements of 26.6% in reflection recognition and 34.0% in glass counting, significantly enhancing the model’s comprehension of complex physical scenes.

📝 Abstract

Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

optical ambiguities

polarization imaging

semantic-physical gap

transparent objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

polarization imaging

vision-language models

dual-stream architecture