🤖 AI Summary
Video-language models (Video-LLMs) face significant bottlenecks in modeling spatial relationships, temporal logical reasoning, and cross-frame continuity. To address these limitations, we propose an annotation-free, video-reencoding-free adversarial preference alignment framework. Our approach introduces the first self-supervised paradigm for generating spatiotemporal adversarial samples tailored to video’s intrinsic spatiotemporal structure. Leveraging only 32-frame sampling and 7,020 self-generated spatiotemporal adversarial sample pairs, we integrate Direct Preference Optimization (DPO) with fine-grained spatial-temporal relational modeling—enabling efficient, architecture-agnostic, plug-and-play alignment. Evaluated on VideoMME, NeXTQA, and LongVideoBench, our method achieves absolute improvements of +3.05%, +1.97%, and +1.31% over Qwen2.5-VL, respectively, outperforming multi-GPU long-sequence training baselines.
📝 Abstract
Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully generated adversarial examples that deliberately violate spatial, temporal, or cross-frame relations. By applying Direct Preference Optimization to just 7,020 preference pairs, VideoPASTA learns robust representations that capture fine-grained spatial relationships and long-range temporal dynamics. Experiments on standard video benchmarks show significant relative performance gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on LongVideoBench, over the baseline Qwen2.5-VL model. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without human annotation or captioning, relying on just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior work. This efficiency makes our approach a scalable, plug-and-play solution that seamlessly integrates with existing models while preserving their capabilities.