Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of existing video understanding benchmarks, which primarily assess static knowledge and fail to evaluate a model’s ability to acquire procedural knowledge from few dynamic demonstrations. To this end, we introduce Demo-ICL, the first in-context learning task tailored for procedural videos, along with Demo-ICL-Benchβ€”a novel benchmark supporting multimodal (text and video) demonstrations. We further propose a two-stage training strategy that combines video-supervised fine-tuning with information-augmented direct preference optimization to enhance the capacity of multimodal large language models to generalize procedural knowledge from demonstrations. Experimental results show that current models perform poorly on this task, whereas our approach significantly improves in-context learning performance, opening a new direction for video understanding research.

Technology Category

Application Category

πŸ“ Abstract
Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models'static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.
Problem

Research questions and friction points this paper is trying to address.

in-context learning
video understanding
multimodal large language models
procedural knowledge
few-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Learning
Multimodal Large Language Models
Video Understanding
Demonstration-based Learning
Preference Optimization
πŸ”Ž Similar Papers
No similar papers found.