VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address severe feature degradation, high computational overhead, and information distortion from aggressive compression in long-video understanding, this paper proposes Hierarchical Visual Token Compression (HiCo), enabling high-fidelity temporal modeling from clip-level to video-level. We design a short-to-long multi-stage training paradigm and introduce LongVid—the first real-world long-video dataset—and the dedicated NIAH benchmark for rigorous evaluation. Built upon a 7B-parameter multimodal large language model, our system supports end-to-end understanding of videos comprising up to ten thousand frames. On the NIAH 10K-frame benchmark, HiCo achieves 99.1% accuracy—the first open-weight 7B-scale model to surpass 99%. It also attains state-of-the-art performance across both mainstream long- and short-video benchmarks. All code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Long-context modeling is a critical capability for multimodal large language models (MLLMs), enabling them to process long-form contents with implicit memorization. Despite its advances, handling extremely long videos remains challenging due to the difficulty in maintaining crucial features over extended sequences. This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation and a practical context modeling system VideoChat-Flash tailored for multimodal long-sequence processing. HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level, reducing the compute significantly while preserving essential details. VideoChat-Flash features a multi-stage short-to-long learning scheme, a rich dataset of real-world long videos named LongVid, and an upgraded"Needle-In-A-video-Haystack"(NIAH) for evaluating context capacities. In extensive experiments, VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 7B model scale. It firstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.

Problem

Research questions and friction points this paper is trying to address.

Long Video Processing

Multimodal Large Language Models

Information Compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

HiCo

VideoChat-Flash

Long Video Compression and Retrieval

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs