InstrAct: Towards Action-Centric Understanding in Instructional Videos

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing video foundation models struggle to capture fine-grained actions and temporal relationships in instructional videos due to supervision noise and static biases. This work proposes InstrAct, the first framework to systematically develop action-centric representation learning for instructional video understanding. InstrAct employs data-driven noisy caption filtering and generates action-centric hard negatives, integrates an action-aware motion extractor, and introduces Dynamic Time Warping Alignment (DTW-Align) alongside Masked Action Modeling (MAM) to enhance learning of action semantics and temporal structure. Evaluated on the newly curated InstrAct Bench, the proposed method significantly outperforms state-of-the-art models across semantic reasoning, procedural logic comprehension, and fine-grained retrieval tasks.

Technology Category

Application Category

📝 Abstract

Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive"static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos'action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.

Problem

Research questions and friction points this paper is trying to address.

instructional videos

action understanding

static bias

temporal relations

fine-grained actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

action-centric representation

contrastive learning with hard negatives

Action Perceiver