🤖 AI Summary
Existing video diffusion models suffer from poor temporal consistency and weak controllability with respect to textual instructions when generating multi-step procedural instructional videos. To address these limitations, we propose a novel framework for instructional video generation built upon a diffusion-autoregressive hybrid architecture that enables streaming, interactive long-video synthesis. Our method introduces three key innovations: (1) a predictive causal adaptation mechanism that explicitly models temporal dependencies among actions; (2) a future-guided self-distillation strategy leveraging dual-region KV caching for dynamic future prompt injection and drift mitigation; and (3) multi-granularity prompt conditioning that jointly grounds generation on both initial frames and structured textual instructions. Experiments demonstrate substantial improvements in temporal coherence and semantic fidelity on complex procedural tasks, consistently outperforming state-of-the-art video diffusion models.
📝 Abstract
Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.