π€ AI Summary
Existing STVG methods primarily focus on descriptive object localization, limiting their applicability to task-oriented, embodied interaction. This paper addresses task-intent-driven spatiotemporal object localization in first-person videos and introduces ToG-Benchβthe first task-oriented temporal video grounding benchmark. Its core contributions are: (1) formalizing a task-oriented localization paradigm grounded in *intent* rather than visual appearance; (2) proposing explicit-implicit dual grounding and one-to-many object association to overcome traditional object-centric annotation constraints; and (3) constructing 100 videos and 2,704 task instructions derived from ScanNet, annotated via a semi-automatic pipeline combining LLM assistance and human refinement, alongside novel evaluation metrics supporting multi-object grounding and implicit reasoning. Comprehensive evaluation across seven state-of-the-art multimodal large language models reveals significant bottlenecks in task-intent comprehension and implicit relational modeling.
π Abstract
A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce extbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) extbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) extbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) extbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..