Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of real-time understanding of continuous video streams (e.g., in autonomous driving and human–machine interaction) by large language models, this work introduces OVBench—the first online video understanding benchmark supporting tri-temporal question answering over past, present, and future frames. We propose the Pyramid Memory Bank (PMB), enabling long-term, efficient storage and retrieval of critical spatiotemporal information. Additionally, we design an interleaved dialogue-style training paradigm for offline-to-online adaptation, integrated with temporal-aware feature fusion and a dedicated online video instruction-tuning dataset. Our model, VideoChat-Online, achieves state-of-the-art performance on both OVBench and mainstream offline benchmarks. It maintains high accuracy while significantly improving inference efficiency and reducing computational overhead, empirically validating the effectiveness of synergistic optimization between temporal modeling and memory mechanisms.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown significant progress in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark specifically designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features six core task types across three temporal contexts-past, present, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Real-time Video Processing
Autonomous Driving and Human-Computer Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Video Processing
Pyramid Memory Bank
Resource-efficient Computing
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
Z
Zhenpeng Huang
Nanjing University, OpenGVLab, Shanghai AI Laboratory
Xinhao Li
Xinhao Li
Nanjing University
Video UnderstandingMultimodal LLMVision-Language Learning
J
Jiaqi Li
China Mobile Research Institute
J
Jing Wang
Nanjing University, OpenGVLab, Shanghai AI Laboratory
X
Xiangyu Zeng
Nanjing University, OpenGVLab, Shanghai AI Laboratory
Cheng Liang
Cheng Liang
Shanghai AI Lab
VLM
T
Tao Wu
Nanjing University, OpenGVLab, Shanghai AI Laboratory
X
Xi Chen
China Mobile Research Institute
L
Liang Li
China Mobile Research Institute
L
Limin Wang
Nanjing University, OpenGVLab, Shanghai AI Laboratory