Object-Shot Enhanced Grounding Network for Egocentric Video

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing egocentric video moment localization methods neglect the dynamic coupling between object semantics and wearer’s visual attention, limiting their effectiveness on fine-grained queries. To address this, we propose a cross-modal alignment network integrating object-aware modeling and shot-level motion trajectory modeling. Specifically, we introduce an object enhancement mechanism that explicitly captures the association between query-specified target objects and the wearer’s visual attention. Additionally, we design a shot-level motion trajectory modeling module that jointly leverages multi-scale video feature extraction, detection-guided fine-grained text–video alignment, and contrastive learning for optimization. Evaluated on three standard benchmarks, our method achieves state-of-the-art performance, with average Recall@1 improvements of 3.2–5.8% over prior works. Notably, it yields substantial gains—particularly for object-centric question-answering queries—demonstrating superior localization accuracy in semantically grounded scenarios.

Technology Category

Application Category

📝 Abstract
Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at https://github.com/Yisen-Feng/OSGNet.
Problem

Research questions and friction points this paper is trying to address.

Enhancing egocentric video grounding with object-shot features
Addressing neglect of egocentric video key characteristics
Improving modality alignment via wearer attention analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts object info to enrich video representation
Analyzes shot movements for attention information
Leverages modality alignment for better performance
🔎 Similar Papers
No similar papers found.
Yisen Feng
Yisen Feng
Harbin Institute of Technology (Shenzhen)
Multimodal Analysis
H
Haoyu Zhang
Pengcheng Laboratory
M
Meng Liu
Shandong Jianzhu University
W
Weili Guan
Harbin Institute of Technology (Shenzhen)
L
Liqiang Nie
Harbin Institute of Technology (Shenzhen)