🤖 AI Summary
This work addresses the limited evaluation of existing self-supervised video models as world models, which often relies solely on accuracy metrics and lacks systematic analysis of robustness and world modeling capabilities. The study introduces the first multidimensional robustness benchmark tailored for video world models, systematically evaluating V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2 across five critical dimensions: occlusion, pixel corruption, temporal direction sensitivity, and others. Results demonstrate that latent-prediction-based models, particularly the V-JEPA family, consistently outperform contrastive approaches across all dimensions. Notably, a frozen V-JEPA 2 backbone paired with a lightweight probe surpasses fully fine-tuned models, exhibiting exceptional representational power and practical utility under complex perturbations.
📝 Abstract
Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.