🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in understanding information-dense short videos, while existing approaches over-rely on static images, neglecting temporal dynamics. Method: We propose a systematic framework that jointly enhances video understanding and general vision-language competence. Our approach introduces a five-modal “cold-start” data mixing strategy and a hierarchical training pipeline comprising four-stage pretraining and two-stage post-training. This pipeline integrates large-scale video data, instruction tuning, chain-of-thought reasoning, reinforcement learning, and behavior correction alignment to endow the model with autonomous capability to determine *when* and *how* to perform dynamic reasoning. Contribution/Results: Our method achieves state-of-the-art performance on the new short-video benchmark KC-MMBench and maintains strong competitiveness—often attaining SOTA—on mainstream video understanding and general vision-language benchmarks.
📝 Abstract
While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce extbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the extbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.