VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the limited fine-grained temporal alignment and compositional reasoning capabilities of video-text models in continuous multi-event videos. To this end, we introduce the first spatiotemporal alignment benchmark tailored to this setting—ActivityNet-Comp and YouCook2-Comp. We propose a hierarchical pairwise preference loss to enable progressive temporal perturbation modeling and design a short-segment concatenation pretraining strategy to mitigate the scarcity of dense temporal annotations. Leveraging ActivityNet-Captions and YouCook2, we construct high-quality spatiotemporal grounding datasets and adopt a three-stage training paradigm integrating contrastive learning, preference ranking, and sequence concatenation. Experiments demonstrate that our approach significantly improves the compositional sensitivity of mainstream vision-language models (VLMs) and large multimodal models (LMMs) under fine-grained perturbations—including temporal reordering and action-word substitution—thereby systematically uncovering critical bottlenecks in dynamic semantic alignment.

Technology Category

Application Category

📝 Abstract

We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

Problem

Research questions and friction points this paper is trying to address.

Advancing video-text compositional and temporal alignment in VLMs

Targeting alignment in continuous multi-event video sequences

Improving model sensitivity to fine-grained temporal disruptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical pairwise preference loss enhances alignment

Pretraining strategy simulates multi-event sequences

Benchmarks test compositional sensitivity in videos

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs