🤖 AI Summary
Existing vision-language models rely on stitched multi-view pinhole images, which struggle to capture the inherent holistic spatial and contextual relationships in panoramic scenes, leading to limited performance in complex omnidirectional scenarios such as occlusions or traffic accidents. This work proposes Panoramic-Language Modeling (PLM), a novel paradigm that enables end-to-end 360-degree vision-language reasoning for the first time, along with PanoVQA, a large-scale panoramic visual question answering dataset. By introducing a plug-and-play panoramic sparse attention module, PLM directly processes equirectangular panoramic images and can enhance existing pinhole-based vision-language models without retraining. Experiments demonstrate that PLM significantly outperforms multi-view composition approaches in challenging omnidirectional settings, validating the principle that holistic panoramic modeling yields greater efficacy than the sum of its parts.
📝 Abstract
Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.