Kwai Keye-VL 1.5 Technical Report

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video understanding faces dual challenges of dynamic content and high information density, where existing models struggle to balance spatial resolution with temporal coverage. To address this, we propose a Slow-Fast video encoding strategy that dynamically allocates computational resources between key frames and static frames. We design a four-stage progressive pretraining paradigm, extending context length to 128K tokens. Furthermore, we introduce a comprehensive post-training framework integrating inference enhancement, human preference alignment, and progressive prompting, augmented by a five-step chain-of-thought data construction pipeline and GSPO-driven iterative reinforcement learning. Our method achieves significant improvements over state-of-the-art models on public video understanding benchmarks and internal human evaluations, while maintaining leading performance across general multimodal tasks.

Technology Category

Application Category

📝 Abstract
In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addressing video understanding challenges in multimodal language models
Balancing spatial resolution and temporal coverage trade-offs
Enhancing long-context video processing and human alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slow-Fast video encoding strategy for dynamic resource allocation
Progressive four-stage pre-training for extended context length
Comprehensive post-training pipeline for reasoning enhancement
🔎 Similar Papers
No similar papers found.
Biao Yang
Biao Yang
Shanghai Jiao Tong University, Antai College of Economics and Management
Asset PricingClimate Finance
Bin Wen
Bin Wen
快手
MLLM
B
Boyang Ding
Kuaishou Group
C
Changyi Liu
Kuaishou Group
C
Chenglong Chu
Kuaishou Group
Chengru Song
Chengru Song
Unknown affiliation
C
Chongling Rao
Kuaishou Group
C
Chuan Yi
Kuaishou Group
D
Da Li
Kuaishou Group
D
Dunju Zang
Kuaishou Group
F
Fan Yang
Kuaishou Group
Guorui Zhou
Guorui Zhou
Unknown affiliation
Recommender System,Advertising,Artificial Intelligence,Machine Learning,NLP
G
Guowang Zhang
Kuaishou Group
Han Shen
Han Shen
Research Engineer, Ant Group; Ph.D., Rensselaer Polytechnic Institute
OptimizationReinforcement LearningAlignment
H
Hao Peng
Kuaishou Group
H
Haojie Ding
Kuaishou Group
H
Hao Wang
Kuaishou Group
H
Hengrui Ju
Kuaishou Group
J
Jiaming Huang
Kuaishou Group
Jiangxia Cao
Jiangxia Cao
Kuaishou Tech
RecSysLow-Resource Large Model
J
Jiankang Chen
Kuaishou Group
Jingyun Hua
Jingyun Hua
Kuaishou
Natural Language ProcessingLarge Language Model
K
Kaibing Chen
Kuaishou Group
Kaiyu Jiang
Kaiyu Jiang
Kuaishou
MLLM
K
Kaiyu Tang
Kuaishou Group