π€ AI Summary
Existing vision-language models rely solely on RGB inputs, limiting their ability to accurately perceive optically ambiguous objects such as reflective or transparent surfaces due to a lack of physical world modeling. This work proposes PolarVLM, the first vision-language model that integrates physical parameters from polarization imaging. By introducing a dual-stream multimodal architecture and a two-stage progressive training strategy, PolarVLM retains general visual understanding capabilities while achieving physical awareness. The study contributes the first polarization-aware visual question answering benchmark, PolarVQA, along with 75K physically grounded instruction-tuning samples. Evaluated across five tasks, PolarVLM outperforms RGB-only baselines by an average of 25.4%, with notable improvements of 26.6% in reflection recognition and 34.0% in glass counting, significantly enhancing the modelβs comprehension of complex physical scenes.
π Abstract
Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.