Leveraging Procedural Knowledge and Task Hierarchies for Efficient Instructional Video Pre-training

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Instructional video understanding faces challenges in modeling fine-grained task-level and step-level semantics under data- and compute-constrained settings. Method: This paper proposes a novel video pre-training framework that jointly incorporates task-level hierarchical structure and procedural step-level prior knowledge. It is the first to simultaneously model task hierarchies and step-wise temporal logic, featuring a unified task/step prediction architecture, hierarchical knowledge distillation, temporal-aware video augmentation, and a validation-performance-driven early-stopping model selection strategy. Contribution/Results: Under small-scale pre-training conditions, the method significantly outperforms baselines on three downstream tasks—task recognition, step recognition, and step prediction—yielding improved recommendation accuracy and enhanced cross-topic generalization capability.

Technology Category

Application Category

📝 Abstract

Instructional videos provide a convenient modality to learn new tasks (ex. cooking a recipe, or assembling furniture). A viewer will want to find a corresponding video that reflects both the overall task they are interested in as well as contains the relevant steps they need to carry out the task. To perform this, an instructional video model should be capable of inferring both the tasks and the steps that occur in an input video. Doing this efficiently and in a generalizable fashion is key when compute or relevant video topics used to train this model are limited. To address these requirements we explicitly mine task hierarchies and the procedural steps associated with instructional videos. We use this prior knowledge to pre-train our model, $ exttt{Pivot}$, for step and task prediction. During pre-training, we also provide video augmentation and early stopping strategies to optimally identify which model to use for downstream tasks. We test this pre-trained model on task recognition, step recognition, and step prediction tasks on two downstream datasets. When pre-training data and compute are limited, we outperform previous baselines along these tasks. Therefore, leveraging prior task and step structures enables efficient training of $ exttt{Pivot}$ for instructional video recommendation.

Problem

Research questions and friction points this paper is trying to address.

Efficient instructional video pre-training

Task and step recognition

Limited compute resources optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mines task hierarchies and procedural steps

Uses prior knowledge for model pre-training

Applies video augmentation and early stopping

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding