VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video-language models (Video-LLMs) face significant bottlenecks in modeling spatial relationships, temporal logical reasoning, and cross-frame continuity. To address these limitations, we propose an annotation-free, video-reencoding-free adversarial preference alignment framework. Our approach introduces the first self-supervised paradigm for generating spatiotemporal adversarial samples tailored to video’s intrinsic spatiotemporal structure. Leveraging only 32-frame sampling and 7,020 self-generated spatiotemporal adversarial sample pairs, we integrate Direct Preference Optimization (DPO) with fine-grained spatial-temporal relational modeling—enabling efficient, architecture-agnostic, plug-and-play alignment. Evaluated on VideoMME, NeXTQA, and LongVideoBench, our method achieves absolute improvements of +3.05%, +1.97%, and +1.31% over Qwen2.5-VL, respectively, outperforming multi-GPU long-sequence training baselines.

Technology Category

Application Category

📝 Abstract
Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully generated adversarial examples that deliberately violate spatial, temporal, or cross-frame relations. By applying Direct Preference Optimization to just 7,020 preference pairs, VideoPASTA learns robust representations that capture fine-grained spatial relationships and long-range temporal dynamics. Experiments on standard video benchmarks show significant relative performance gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on LongVideoBench, over the baseline Qwen2.5-VL model. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without human annotation or captioning, relying on just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior work. This efficiency makes our approach a scalable, plug-and-play solution that seamlessly integrates with existing models while preserving their capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Video-LLMs' spatial-temporal understanding
Optimizing models with adversarial preference alignment
Improving video-language alignment efficiently without human annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training with spatio-temporal-cross frame examples
Direct Preference Optimization on 7K pairs
Efficient 32-frame sampling without human annotation
🔎 Similar Papers
No similar papers found.