DOMR: Establishing Cross-View Segmentation via Dense Object Matching

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper addresses the cross-view object matching problem between egocentric (first-person) and exocentric (third-person) perspectives. To tackle this challenge, we propose a dense object correspondence modeling framework. Our method introduces a joint encoder that densely matches visual, spatial, and semantic features, simultaneously capturing multi-object geometric relationships and semantic consistency across views. Additionally, we design a mask optimization head that jointly learns proposal generation and mask refinement in an end-to-end manner. Evaluated on the Ego-Exo4D benchmark, our approach achieves 49.7% and 55.2% mIoU—representing absolute improvements of 5.8% and 4.3% over prior state-of-the-art methods. To the best of our knowledge, this is the first work to achieve pixel-level, object-level dense correspondence across egocentric and exocentric views, coupled with high-fidelity segmentation accuracy.

Technology Category

Application Category

📝 Abstract

Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego$ o$Exo and 55.2% on Exo$ o$Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.

Problem

Research questions and friction points this paper is trying to address.

Establishing dense object correspondences between egocentric and exocentric views

Jointly modeling multiple objects with visual, spatial, and semantic cues

Improving mask completeness and accuracy in cross-view segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense Object Matcher models multiple objects jointly

Integrates visual, spatial, and semantic cues

Combines matching with mask refinement for accuracy

🔎 Similar Papers

Deep Common Feature Mining for Efficient Video Semantic Segmentation