VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe feature degradation, high computational overhead, and information distortion from aggressive compression in long-video understanding, this paper proposes Hierarchical Visual Token Compression (HiCo), enabling high-fidelity temporal modeling from clip-level to video-level. We design a short-to-long multi-stage training paradigm and introduce LongVid—the first real-world long-video dataset—and the dedicated NIAH benchmark for rigorous evaluation. Built upon a 7B-parameter multimodal large language model, our system supports end-to-end understanding of videos comprising up to ten thousand frames. On the NIAH 10K-frame benchmark, HiCo achieves 99.1% accuracy—the first open-weight 7B-scale model to surpass 99%. It also attains state-of-the-art performance across both mainstream long- and short-video benchmarks. All code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Long-context modeling is a critical capability for multimodal large language models (MLLMs), enabling them to process long-form contents with implicit memorization. Despite its advances, handling extremely long videos remains challenging due to the difficulty in maintaining crucial features over extended sequences. This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation and a practical context modeling system VideoChat-Flash tailored for multimodal long-sequence processing. HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level, reducing the compute significantly while preserving essential details. VideoChat-Flash features a multi-stage short-to-long learning scheme, a rich dataset of real-world long videos named LongVid, and an upgraded"Needle-In-A-video-Haystack"(NIAH) for evaluating context capacities. In extensive experiments, VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 7B model scale. It firstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.
Problem

Research questions and friction points this paper is trying to address.

Long Video Processing
Multimodal Large Language Models
Information Compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

HiCo
VideoChat-Flash
Long Video Compression and Retrieval
🔎 Similar Papers
No similar papers found.
Xinhao Li
Xinhao Li
Nanjing University
Video UnderstandingMultimodal LLMVision-Language Learning
Y
Yi Wang
Shanghai AI Laboratory
Jiashuo Yu
Jiashuo Yu
Shanghai AI Laboratory
Audio-Visual LearningComputer VisionMultimodal Learning
X
Xiangyu Zeng
Nanjing University, Shanghai AI Laboratory
Yuhan Zhu
Yuhan Zhu
Nanjing University, Shanghai AI Lab
Computer VisionVision-Language ModelsVideo Understanding
H
Haian Huang
Shanghai AI Laboratory
J
Jianfei Gao
Shanghai AI Laboratory
Kunchang Li
Kunchang Li
ByteDance Seed
Video UnderstandingMultimodal Learning
Yinan He
Yinan He
Shanghai Al Laboratory
Chenting Wang
Chenting Wang
Shanghai Jiao Tong University
Computer VisionVideo Understanding
Y
Yu Qiao
Shanghai AI Laboratory
Y
Yali Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shanghai AI Laboratory
L
Limin Wang
Nanjing University, Shanghai AI Laboratory