Taming generative video models for zero-shot optical flow extraction

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing zero-shot optical flow estimation methods, which either require fine-tuning or rely on photometric constraints. We propose a zero-shot optical flow extraction framework that directly decodes latent motion information from a frozen self-supervised video generative model—without any adaptation. To this end, we introduce the counterfactual world model paradigm into generative video modeling and design KL-tracing: a novel algorithm that leverages the model’s inherent distributional prediction capability, factorized latent space structure, and random-access decoding property. By applying localized perturbations and performing single-step rollouts, KL-tracing quantifies inter-frame latent variable changes via KL divergence to track optical flow. On TAP-Vid DAVIS and Kubric, our method reduces endpoint error by 16.6% and 4.7%, respectively, outperforming both unsupervised and photometric-loss baselines. This demonstrates that frozen generative models inherently encode interpretable, geometrically meaningful motion representations.

Technology Category

Application Category

📝 Abstract
Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.
Problem

Research questions and friction points this paper is trying to address.

Extracting optical flow from videos without fine-tuning
Using generative models for zero-shot flow extraction
Improving flow accuracy on real-world and synthetic datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot flow extraction without fine-tuning
Counterfactual prompting with KL-tracing
Leveraging LRAS architecture properties
🔎 Similar Papers
2024-07-11Neural Information Processing SystemsCitations: 0
S
Seungwoo Kim
Stanford University
K
Khai Loong Aw
Stanford University
Klemen Kotar
Klemen Kotar
PhD Candidate, Stanford University
Artificial Intelligence
Cristobal Eyzaguirre
Cristobal Eyzaguirre
Ph.D. Student, Stanford University
W
Wanhee Lee
Stanford University
Y
Yunong Liu
Stanford University
J
Jared Watrous
Stanford University
Stefan Stojanov
Stefan Stojanov
Postdoc at Stanford Vision Lab and Neuro AI Lab
Computer VisionMachine Learning
Juan Carlos Niebles
Juan Carlos Niebles
Research Director (Salesforce) & Adjunct Professor (Stanford University)
Action RecognitionVideo UnderstandingVideo AnalysisComputer Vision
J
Jiajun Wu
Stanford University
D
Daniel L. K. Yamins
Stanford University