ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

πŸ“… 2025-12-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing STVG methods primarily focus on descriptive object localization, limiting their applicability to task-oriented, embodied interaction. This paper addresses task-intent-driven spatiotemporal object localization in first-person videos and introduces ToG-Benchβ€”the first task-oriented temporal video grounding benchmark. Its core contributions are: (1) formalizing a task-oriented localization paradigm grounded in *intent* rather than visual appearance; (2) proposing explicit-implicit dual grounding and one-to-many object association to overcome traditional object-centric annotation constraints; and (3) constructing 100 videos and 2,704 task instructions derived from ScanNet, annotated via a semi-automatic pipeline combining LLM assistance and human refinement, alongside novel evaluation metrics supporting multi-object grounding and implicit reasoning. Comprehensive evaluation across seven state-of-the-art multimodal large language models reveals significant bottlenecks in task-intent comprehension and implicit relational modeling.

Technology Category

Application Category

πŸ“ Abstract
A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce extbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) extbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) extbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) extbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..
Problem

Research questions and friction points this paper is trying to address.

Develops a benchmark for task-oriented object localization in egocentric videos
Addresses explicit and implicit object grounding through contextual reasoning
Evaluates multi-object grounding where one instruction corresponds to multiple objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-oriented grounding based on intended tasks
Explicit-implicit dual grounding via contextual reasoning
One-to-many grounding for multiple objects per instruction
πŸ”Ž Similar Papers
No similar papers found.
Qi'ao Xu
Qi'ao Xu
East China Normal University
Tianwen Qian
Tianwen Qian
East China Normal University
MultimediaVision and LanguageEmbodied AI
Y
Yuqian Fu
INSAIT, Sofia University "St. Kliment Ohridski"
K
Kailing Li
School of Computer Science and Technology, East China Normal University
Y
Yang Jiao
Fudan University
J
Jiacheng Zhang
Fudan University
X
Xiaoling Wang
School of Computer Science and Technology, East China Normal University
L
Liang He
School of Computer Science and Technology, East China Normal University