🤖 AI Summary
This work addresses the limitation of existing zero-shot optical flow estimation methods, which either require fine-tuning or rely on photometric constraints. We propose a zero-shot optical flow extraction framework that directly decodes latent motion information from a frozen self-supervised video generative model—without any adaptation. To this end, we introduce the counterfactual world model paradigm into generative video modeling and design KL-tracing: a novel algorithm that leverages the model’s inherent distributional prediction capability, factorized latent space structure, and random-access decoding property. By applying localized perturbations and performing single-step rollouts, KL-tracing quantifies inter-frame latent variable changes via KL divergence to track optical flow. On TAP-Vid DAVIS and Kubric, our method reduces endpoint error by 16.6% and 4.7%, respectively, outperforming both unsupervised and photometric-loss baselines. This demonstrates that frozen generative models inherently encode interpretable, geometrically meaningful motion representations.
📝 Abstract
Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.