Towards Cross-View Point Correspondence in Vision-Language Models

πŸ“… 2025-12-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language models (VLMs) exhibit limited performance on cross-view point-level correspondence (CVPC), struggling to precisely localize affordance regionsβ€”a key bottleneck for fine-grained embodied interaction. To address this, we introduce CVPC as a novel task and propose CrossPoint-Bench, a hierarchical benchmark with rigorous evaluation protocols. We further present CrossPoint-378K, the first large-scale, affordance-oriented cross-view point correspondence dataset, comprising 378K annotated correspondences across diverse object views and interaction contexts. Leveraging this dataset, we design CroPond, a model integrating perception, geometric reasoning, and explicit correspondence learning modules. Experiments demonstrate that CroPond achieves a 39.7% absolute accuracy gain over Gemini-2.5-Pro on CrossPoint-Bench, substantially narrowing the gap with human performance. This work establishes a new paradigm, benchmark, and model architecture for advancing spatial understanding and cross-view alignment in VLMs.

Technology Category

Application Category

πŸ“ Abstract
Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
Problem

Research questions and friction points this paper is trying to address.

Establishes cross-view point correspondence task for vision-language models
Addresses gap in fine-grained coordinate prediction for spatial understanding
Develops dataset and model for actionable affordance region interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Cross-View Point Correspondence task and benchmark
Constructs large dataset focused on actionable affordance regions
Introduces CroPond model trained on dataset for state-of-art performance
πŸ”Ž Similar Papers
No similar papers found.
Y
Yipu Wang
Institute of Automation, Chinese Academy of Sciences
Yuheng Ji
Yuheng Ji
Institute of Automation, Chinese Academy of Sciences
Embodied AIComputer Vision
Y
Yuyang Liu
Institute of Automation, Chinese Academy of Sciences
Enshen Zhou
Enshen Zhou
Beihang University
Embodied AIEmbodied AgentRobot LearningGenerative Model
Z
Ziqiang Yang
Jilin University
Yuxuan Tian
Yuxuan Tian
Peking University
Z
Ziheng Qin
Institute of Automation, Chinese Academy of Sciences
Y
Yue Liu
National University of Singapore
Huajie Tan
Huajie Tan
Peking University
Embodied AIFoundation Models
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
Z
Zhiyuan Ma
Huazhong University of Science And Technology
D
Daniel Dajun Zeng
Institute of Automation, Chinese Academy of Sciences
X
Xiaolong Zheng
Institute of Automation, Chinese Academy of Sciences