Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of real-time understanding of continuous video streams (e.g., in autonomous driving and human–machine interaction) by large language models, this work introduces OVBench—the first online video understanding benchmark supporting tri-temporal question answering over past, present, and future frames. We propose the Pyramid Memory Bank (PMB), enabling long-term, efficient storage and retrieval of critical spatiotemporal information. Additionally, we design an interleaved dialogue-style training paradigm for offline-to-online adaptation, integrated with temporal-aware feature fusion and a dedicated online video instruction-tuning dataset. Our model, VideoChat-Online, achieves state-of-the-art performance on both OVBench and mainstream offline benchmarks. It maintains high accuracy while significantly improving inference efficiency and reducing computational overhead, empirically validating the effectiveness of synergistic optimization between temporal modeling and memory mechanisms.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown significant progress in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark specifically designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features six core task types across three temporal contexts-past, present, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Real-time Video Processing

Autonomous Driving and Human-Computer Interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Video Processing

Pyramid Memory Bank

Resource-efficient Computing

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13

Authors to Follow