🤖 AI Summary
This paper addresses the cross-view object matching problem between egocentric (first-person) and exocentric (third-person) perspectives. To tackle this challenge, we propose a dense object correspondence modeling framework. Our method introduces a joint encoder that densely matches visual, spatial, and semantic features, simultaneously capturing multi-object geometric relationships and semantic consistency across views. Additionally, we design a mask optimization head that jointly learns proposal generation and mask refinement in an end-to-end manner. Evaluated on the Ego-Exo4D benchmark, our approach achieves 49.7% and 55.2% mIoU—representing absolute improvements of 5.8% and 4.3% over prior state-of-the-art methods. To the best of our knowledge, this is the first work to achieve pixel-level, object-level dense correspondence across egocentric and exocentric views, coupled with high-fidelity segmentation accuracy.
📝 Abstract
Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego$ o$Exo and 55.2% on Exo$ o$Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.