Towards Object-centric Understanding for Instructional Videos

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing action-centric video understanding methods struggle to model dynamic, object-state-dependent step sequences in real-world tasks. This paper proposes an object-centric paradigm for video understanding, treating actions as mechanisms that drive object state evolution, and introduces an agent framework supporting multi-hop reasoning and evidence localization. Our contributions are threefold: (1) We release Object-IVQA—the first benchmark explicitly designed for object-state reasoning—evaluating four capabilities: state evolution, prerequisite verification, counterfactual reasoning, and error identification; (2) We design a modular architecture integrating object-level planning, perception, analysis, and generation, enabling explicit cross-clip evidence retrieval and multi-step reasoning; (3) We synergistically combine vision-language models with tool-augmented mechanisms. Experiments reveal substantial deficits of current large models in object-level reasoning; our framework achieves significant performance gains on Object-IVQA, advancing video understanding from action-centric to object-centric paradigms.

Technology Category

Application Category

📝 Abstract

Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.

Problem

Research questions and friction points this paper is trying to address.

Develops object-centric reasoning for instructional videos

Evaluates state evolution and counterfactual reasoning in procedures

Improves evidence retrieval and multi-hop reasoning across segments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric paradigm for instructional video understanding

Agent framework integrating planning, perception, analysis, generation

Explicit evidence retrieval and multi-hop reasoning across segments

🔎 Similar Papers

No similar papers found.