Object-Centric Framework for Video Moment Retrieval

📅 2025-12-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video moment retrieval methods rely on frame- or snippet-level global features, which struggle to capture fine-grained object semantics and their dynamic interactions, leading to inaccurate localization for entity-relation queries. To address this, we propose an object-centric retrieval framework: first, parsing natural language queries into scene graphs and generating frame-level scene graphs for the video; second, extracting object-level feature sequences; and third, introducing a Relation Trajectory Transformer that explicitly models object state evolution and cross-frame spatiotemporal relations. Our approach is the first to jointly integrate scene graph structure with relation trajectory modeling, overcoming the limitations of conventional global representations. Extensive experiments demonstrate state-of-the-art performance on three major benchmarks—Charades-STA, QVHighlights, and TACoS—validating that object-level fine-grained modeling significantly enhances localization accuracy.

Technology Category

Application Category

📝 Abstract
Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Captures fine-grained object semantics for moment retrieval
Models object-level temporal dynamics and interactions
Improves accuracy in localizing object-oriented query moments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric framework using scene graphs
Relational tracklet transformer for spatio-temporal correlations
Object-level feature sequences for fine-grained semantics
🔎 Similar Papers
No similar papers found.
Zongyao Li
Zongyao Li
Visual Intelligence Research Laboratories, NEC Corporation
Y
Yongkang Wong
National University of Singapore
S
Satoshi Yamazaki
Visual Intelligence Research Laboratories, NEC Corporation
Jianquan Liu
Jianquan Liu
Director | Senior Principal Researcher, Visual Intelligence Research Laboratories, NEC Corporation
DatabaseMultimediaData MiningInformation Retrieval
M
Mohan Kankanhalli
National University of Singapore