🤖 AI Summary
This work addresses the insufficient joint modeling of action steps and scene state changes in procedural activity understanding. We propose a process-aware video representation learning framework that, for the first time, leverages explicit state-change descriptions generated by large language models (LLMs) as supervisory signals—and constructs their counterfactual variants—to jointly model bidirectional causal relationships between actions and states, thereby enhancing “if–then” reasoning. Our key contributions are: (1) the first use of LLM-generated state descriptions and their counterfactual counterparts for video representation learning; and (2) unified causal modeling of both normative procedures and anomalous/erroneous steps. Extensive experiments demonstrate significant improvements over state-of-the-art methods on temporal action segmentation and procedural error detection, validating the effectiveness of explicit state supervision and counterfactual reasoning for procedural understanding.
📝 Abstract
Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Yet, existing work on procedure-aware video representations fails to explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by LLMs as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation, error detection, and more. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals, and achieve significant improvements on multiple tasks.