Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses co-visibility reasoning—high-order spatial inference for identifying overlapping visible regions across views—under sparse indoor image collections. To this end, we introduce Co-VisiON, the first dedicated benchmark comprising over 1,000 real-world indoor scenes. We propose Covis, a human-cognition-inspired, purely visual baseline model that integrates multi-view geometric modeling, cross-image attention-based feature aggregation, and sparse-view spatial consistency learning. Covis elevates co-visibility from low-level feature matching to a structured spatial reasoning task. Experiments show that Covis achieves state-of-the-art performance among purely visual models and substantially narrows the gap with specialized vision-language models. Nevertheless, all current models fall significantly short of human performance, exposing a fundamental bottleneck in high-order spatial reasoning under sparse visual conditions.

Technology Category

Application Category

📝 Abstract
Humans exhibit a remarkable ability to recognize co-visibility-the overlapping regions visible in multiple images-even when these images are sparsely distributed across a complex scene. This capability is foundational in 3D vision and robotic perception. Despite significant progress in vision learning, it remains unclear whether current vision models have reached human-level proficiency in co-visibility analysis. In this work, we introduce the Co-Visibility reasONing (Co-VisiON) benchmark, designed to directly evaluate co-visibility reasoning on sparse image sets across over 1000 indoor scenarios. Our experiments reveal that while co-visibility is typically treated as a low-level feature matching task, it poses a significant challenge for existing vision models under sparse conditions. Notably, a proprietary vision-language model outperforms all purely vision-based approaches, with all models lagging substantially behind human performance. This gap underscores the need for more than basic pairwise vision processing-it calls for a comprehensive spatial understanding through high-level reasoning across multiple views. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, high-level reasoning in challenging, sparse environments. Our dataset and source code can be found at: https://ai4ce.github.io/CoVISION
Problem

Research questions and friction points this paper is trying to address.

Evaluating co-visibility reasoning in sparse image sets
Assessing human-level proficiency in vision models
Improving spatial understanding in multi-view analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Co-VisiON benchmark for co-visibility evaluation
Proposes Covis multi-view baseline inspired by human cognition
Highlights gap between models and human co-visibility reasoning
🔎 Similar Papers
C
Chao Chen
New York University, Brooklyn, NY 11201, USA
N
Nobel Dang
New York University, Brooklyn, NY 11201, USA
Juexiao Zhang
Juexiao Zhang
CS PhD student at New York Univeristy
Machine LearningComputer VisionRobotics
W
Wenkai Sun
New York University, Brooklyn, NY 11201, USA
Pengfei Zheng
Pengfei Zheng
Huawei Technologies
Machine Learning SystemSystem-Algorithm Co-DesignDistributed SystemData+AI
X
Xuhang He
New York University, Brooklyn, NY 11201, USA
Y
Yimeng Ye
New York University, Brooklyn, NY 11201, USA
T
Taarun Srinivas
New York University, Brooklyn, NY 11201, USA
C
Chen Feng
New York University, Brooklyn, NY 11201, USA