SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

📅 2025-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models (LVLMs) lack systematic evaluation of long-term temporal reasoning for streaming video understanding; mainstream benchmarks focus on single-frame or single-instance question answering, neglecting continuous spatiotemporal inference. Method: We introduce SVBench—the first streaming video temporal multi-turn dialogue benchmark—comprising 1,353 long videos and 49,979 timestamped QA chains. We propose novel cross-segment temporal linkage modeling and temporal QA chain generation. Further, we develop StreamingChat, the first open-source LVLM achieving significant progress in streaming video understanding, trained via a semi-automatic annotation pipeline and joint fine-tuning of Qwen-VL and Video-LLaMA. Contribution/Results: Our evaluation across 14 models reveals critical bottlenecks in long-context temporal reasoning. StreamingChat achieves state-of-the-art performance on SVBench among open-source LVLMs while maintaining competitive results on general vision-and-language benchmarks.

Technology Category

Application Category

📝 Abstract
Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://yzy-bupt.github.io/SVBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs on long-context streaming video understanding
Addressing gaps in temporal reasoning of video streams
Introducing SVBench for multi-turn dialogue assessments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SVBench for video understanding
Semi-automated annotation for QA pairs
StreamingChat model outperforms open-source LVLMs
🔎 Similar Papers
No similar papers found.
Z
Zhenyu Yang
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Kuaishou Technology
Y
Yuhang Hu
Zhengzhou University
Z
Zemin Du
ShanghaiTech University
Dizhan Xue
Dizhan Xue
Institute of Automation, Chinese Academy of Sciences
MultimediaCross-Modal ReasoningExplainable AI
Shengsheng Qian
Shengsheng Qian
PhD. Institute of Automation, Chinese Academy Sciences (CASIA)
Data MiningMultimediaSocial Media
J
Jiahong Wu
Kuaishou Technology
F
Fan Yang
Kuaishou Technology
W
Weiming Dong
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Changsheng Xu
Changsheng Xu
Professor, Institute of Automation, Chinese Academy of Sciences
MultimediaComputer vision