🤖 AI Summary
Current vision-language models suffer from coarse annotations, skewed data distributions, and insufficient compositional generalization in temporal alignment tasks. To address these limitations, we introduce SVLTA—the first controllable synthetic benchmark for vision-language temporal alignment—designed to evaluate fine-grained synchronization between dynamic visual events and natural language descriptions. Our method innovates a video-scenario generation paradigm grounded in commonsense knowledge, manipulable action primitives, and constraint-based filtering, enabling statistical distribution decoupling and high-fidelity synthesis. We conduct rigorous evaluation via temporal question answering, distribution shift testing, and alignment diagnostics, uncovering systematic temporal localization biases across mainstream models. Empirical results demonstrate that SVLTA achieves high diversity, semantic plausibility, and strong diagnostic efficacy. It establishes a reproducible, attributable, and fine-grained evaluation framework for temporal alignment capability—advancing both benchmark design and model diagnosis in vision-language understanding.
📝 Abstract
Vision-language temporal alignment is a crucial capability for human dynamic recognition and cognition in real-world scenarios. While existing research focuses on capturing vision-language relevance, it faces limitations due to biased temporal distributions, imprecise annotations, and insufficient compositionally. To achieve fair evaluation and comprehensive exploration, our objective is to investigate and evaluate the ability of models to achieve alignment from a temporal perspective, specifically focusing on their capacity to synchronize visual scenarios with linguistic context in a temporally coherent manner. As a preliminary step, we present the statistical analysis of existing benchmarks and reveal the existing challenges from a decomposed perspective. To this end, we introduce SVLTA, the Synthetic Vision-Language Temporal Alignment derived via a well-designed and feasible control generation method within a simulation environment. The approach considers commonsense knowledge, manipulable action, and constrained filtering, which generates reasonable, diverse, and balanced data distributions for diagnostic evaluations. Our experiments reveal diagnostic insights through the evaluations in temporal question answering, distributional shift sensitiveness, and temporal alignment adaptation.