VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the instability in pixel-level instance correspondence between first-person and third-person views, caused by disparities in scale, viewpoint, and occlusion. To tackle this challenge, the authors propose a self-supervised approach that integrates geometric modeling with semantic segmentation. The method introduces three key innovations: geometry-aware feature alignment, a three-stage Union Segmentation Head that combines mask-prompt fusion, point-guided prediction, and iterative refinement, and a self-supervised pretraining strategy requiring only single-view images without cross-view paired annotations. Evaluated on the Ego-Exo4D benchmark, the model achieves mean Intersection-over-Unions of 67.7% and 68.0%, substantially outperforming existing methods; notably, its variant pretrained without any cross-view correspondence annotations even surpasses most fully supervised baselines.

Technology Category

Application Category

📝 Abstract

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level textntion remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.

Problem

Research questions and friction points this paper is trying to address.

cross-view segmentation

egocentric-exocentric alignment

instance-level segmentation

pixel-level projection drift

dense prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view segmentation

geometry-aware modeling

self-supervised training