🤖 AI Summary
While image diffusion models have been extensively studied for visual representation learning, the representational capabilities of video diffusion models—particularly their utility for vision understanding tasks—remain underexplored.
Method: We systematically investigate video diffusion models through a unified architectural framework, conducting controlled comparative experiments to directly assess how video- versus image-diffusion objectives influence downstream representation learning across four tasks: image classification, action recognition, depth estimation, and object tracking.
Contribution/Results: We find that temporal modeling significantly enhances general-purpose visual representations, with gains exhibiting task dependence and non-uniformity across layers. Quantitative analysis reveals measurable impacts of multi-layer feature utilization, noise scheduling, and training budget on representation quality. Crucially, video diffusion models consistently outperform their image-based counterparts across all evaluated tasks, demonstrating the critical role of temporal priors in understanding-oriented representation learning. This work bridges a key gap by providing both theoretical grounding and empirical evidence for extending diffusion models from generative to discriminative paradigms.
📝 Abstract
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.