O-MaMa @ EgoExo4D Correspondence Challenge: Learning Object Mask Matching between Egocentric and Exocentric Views

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of establishing pixel-level correspondences and mask matching for the same object across egocentric and exocentric views. The proposed method introduces a mask context encoder and a bidirectional cross-view cross-attention mechanism to achieve fine-grained, mask-level alignment within a shared latent space. It integrates semantic features from DINOv2 with candidate masks generated by FastSAM, and employs a contrastive learning loss coupled with an adaptive hard-negative neighborhood mining strategy. Notably, this is the first approach to formulate cross-view segmentation as a mask matching task, significantly improving robustness under occlusion and large viewpoint disparities. Evaluated on the EgoExo4D Correspondence Challenge, the method achieves a 12.3% improvement in mask matching mAP, demonstrating both effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract
The goal of the correspondence task is to segment specific objects across different views. This technical report re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$leftrightarrow$Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.
Problem

Research questions and friction points this paper is trying to address.

Segment objects across egocentric and exocentric views
Treat cross-image segmentation as mask matching
Align cross-view features in shared latent space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-Context Encoder pools DINOv2 features
Ego-Exo Cross-Attention fuses multi-perspective views
Mask Matching loss aligns cross-view features
🔎 Similar Papers
No similar papers found.