SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current vision-language models suffer from coarse annotations, skewed data distributions, and insufficient compositional generalization in temporal alignment tasks. To address these limitations, we introduce SVLTA—the first controllable synthetic benchmark for vision-language temporal alignment—designed to evaluate fine-grained synchronization between dynamic visual events and natural language descriptions. Our method innovates a video-scenario generation paradigm grounded in commonsense knowledge, manipulable action primitives, and constraint-based filtering, enabling statistical distribution decoupling and high-fidelity synthesis. We conduct rigorous evaluation via temporal question answering, distribution shift testing, and alignment diagnostics, uncovering systematic temporal localization biases across mainstream models. Empirical results demonstrate that SVLTA achieves high diversity, semantic plausibility, and strong diagnostic efficacy. It establishes a reproducible, attributable, and fine-grained evaluation framework for temporal alignment capability—advancing both benchmark design and model diagnosis in vision-language understanding.

Technology Category

Application Category

📝 Abstract

Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.

Problem

Research questions and friction points this paper is trying to address.

Evaluate vision-language temporal alignment in models

Address biased data and imprecise annotation limitations

Generate balanced synthetic data for diagnostic evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic video generation for temporal alignment

Control generation method in simulation

Balanced data distribution via constrained filtering

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs