🤖 AI Summary
To address severe feature degradation, high computational overhead, and information distortion from aggressive compression in long-video understanding, this paper proposes Hierarchical Visual Token Compression (HiCo), enabling high-fidelity temporal modeling from clip-level to video-level. We design a short-to-long multi-stage training paradigm and introduce LongVid—the first real-world long-video dataset—and the dedicated NIAH benchmark for rigorous evaluation. Built upon a 7B-parameter multimodal large language model, our system supports end-to-end understanding of videos comprising up to ten thousand frames. On the NIAH 10K-frame benchmark, HiCo achieves 99.1% accuracy—the first open-weight 7B-scale model to surpass 99%. It also attains state-of-the-art performance across both mainstream long- and short-video benchmarks. All code, models, and datasets are publicly released.
📝 Abstract
Long-context modeling is a critical capability for multimodal large language models (MLLMs), enabling them to process long-form contents with implicit memorization. Despite its advances, handling extremely long videos remains challenging due to the difficulty in maintaining crucial features over extended sequences. This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation and a practical context modeling system VideoChat-Flash tailored for multimodal long-sequence processing. HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level, reducing the compute significantly while preserving essential details. VideoChat-Flash features a multi-stage short-to-long learning scheme, a rich dataset of real-world long videos named LongVid, and an upgraded"Needle-In-A-video-Haystack"(NIAH) for evaluating context capacities. In extensive experiments, VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 7B model scale. It firstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.