Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

πŸ“… 2025-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of synchronizing semantically accurate and temporally precise language generation in real-time interactive settings with vision-language models (VLMs). To this end, we introduce TGLGβ€”the first temporal alignment benchmark for streaming video language generation. We formally define two core capabilities: *perceptual update* (timely incorporation of new visual evidence) and *contextual response* (generation conditioned on evolving scene dynamics). We propose TRACE, a cross-modal temporal alignment metric jointly evaluating semantic fidelity and temporal accuracy. Furthermore, we present VLM-TSI, a time-synchronized interleaved architecture that abandons conventional turn-based modeling assumptions. Evaluated on sports commentary and egocentric interaction datasets, VLM-TSI significantly outperforms strong baselines; however, overall performance remains challenging. This work establishes a standardized evaluation framework, releases an open-source benchmark, and provides foundational methodology for real-time VLM research.

Technology Category

Application Category

πŸ“ Abstract
Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- $ extit{perceptual updating}$ and $ extit{contingency awareness}$ -- and propose a new benchmark task, $ extbf{Temporally-Grounded Language Generation (TGLG)}$, to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, $ extbf{TRACE}$, to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present $ extbf{Vision-Language Model with Time-Synchronized Interleaving (VLM-TSI)}$, a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest -- highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available $href{https://github.com/yukw777/tglg}{here}$.
Problem

Research questions and friction points this paper is trying to address.

Evaluating real-time vision-language models for dynamic visual input
Assessing perceptual updating and contingency awareness in VLMs
Developing temporally-grounded language generation benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-synchronized interleaving of visual and linguistic tokens
TGLG benchmark for real-time vision-language evaluation
TRACE metric for semantic and temporal alignment
πŸ”Ž Similar Papers
Keunwoo Peter Yu
Keunwoo Peter Yu
Wayve
Multi-modal LearningEmbodied AINatural Language ProcessingVision-Language Models
J
Joyce Chai
Computer Science and Engineering Division, University of Michigan, Ann Arbor, MI, USA