Kimi-VL Technical Report

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces Kimi-VL—the first efficient, open-source Mixture-of-Experts (MoE) vision-language model—designed to overcome efficiency and performance bottlenecks in multimodal large models for long-context understanding, complex reasoning, and agent capabilities. Methodologically: (1) it proposes MoonViT, a native high-resolution visual encoder enabling fine-grained image and video perception; (2) it constructs a language decoder MoE architecture with only 2.8B active parameters; and (3) it introduces Kimi-VL-Thinking, a long-chain reasoning variant integrating chain-of-thought (CoT) supervised fine-tuning and reinforcement learning for multi-step inference optimization. Experiments demonstrate that Kimi-VL achieves state-of-the-art results on long-context and OCR benchmarks—including LongVideoBench (64.5), MMLongBench-Doc (35.1), and InfoVQA (83.2). Kimi-VL-Thinking further surpasses closed-source models such as GPT-4o on MMMU (61.7) and MathVista (71.3), validating its superiority in university-level multimodal reasoning and agent-oriented tasks.

Technology Category

Application Category

📝 Abstract
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
Problem

Research questions and friction points this paper is trying to address.

Develops efficient open-source vision-language model for multimodal reasoning
Enables long-context understanding with 128K extended window
Advances high-resolution visual input processing with MoonViT encoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts vision-language model
128K extended context window
Native-resolution vision encoder MoonViT
🔎 Similar Papers
No similar papers found.
K
Kimi Team Angang Du
B
Bohong Yin
B
Bowei Xing
Bowen Qu
Bowen Qu
Peking University, Ex: Rhymes.ai Aria Team
Multimodal learningVision-Language ModelsComputer Vision
B
Bowen Wang
C
Cheng Chen
C
Chenlin Zhang
C
Chenzhuang Du
Chu Wei
Chu Wei
Congcong Wang
Congcong Wang
Dehao Zhang
Dehao Zhang
University of Electronic Science and Technology of China
Spiking Neural Network
D
Dikang Du
Dongliang Wang
Dongliang Wang
Moonshot AI
Video Understanding and Generation3D VisionMotion SynthesisMachine Learning
E
Enming Yuan
E
Enzhe Lu
F
Fang Li
Flood Sung
Flood Sung
Moonshot AI
Foundation ModelsLLM/VLMAgentReinforcement LearningMeta Learning
G
Guangda Wei
Guokun Lai
Guokun Lai
Inflection AI
machine learning
H
Han Zhu
H
Hao Ding
H
Hao Hu
H
Hao Yang
H
Hao Zhang
Haoning Wu
Haoning Wu
Shanghai Jiao Tong University
Computer VisionMulti-modal LearningGenerative Models
H
Haotian Yao
H
Haoyu Lu
H
Heng Wang
Hongcheng Gao
Hongcheng Gao
University of Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsVision Language Models
H
Huabin Zheng
J
Jiaming Li
Jianlin Su
Jianlin Su
Moonshot AI
J
Jianzhou Wang
Jiaqi Deng
Jiaqi Deng
The University of Hong Kong
Deep LearningNatural Language Processing
Jiezhong Qiu
Jiezhong Qiu
Zhejiang University - Zhejiang Lab Hundred Talents Program Researcher
Data MiningSocial Network AnalysisNetwork EmbeddingGraph Neural Networks
J
Jin Xie
J
Jinhong Wang
J
Jingyuan Liu
J
Junjie Yan
Kun Ouyang
Kun Ouyang
National University of Singapore
human mobilitymachine learning
L
Liang Chen
Lin Sui
Lin Sui
Moonshot AI Ltd
Computer Vision
Longhui Yu
Longhui Yu
Kimi & University of Toronto & Peking University
AI AlignmentLarge Language ModelAI4MathTrustworthy AIContinual Learning
M
Mengfan Dong
M
Mengnan Dong
N
Nuo Xu
Pengyu Cheng
Pengyu Cheng
Alibaba Group
machine learningnatural language processing
Q
Qizheng Gu
R
Runjie Zhou
Shaowei Liu
Shaowei Liu
University of Illinois Urbana-Champaign
Computer VisionRobotics
S
Sihan Cao
T
Tao Yu
Tianhui Song
Tianhui Song
Nanjing University
computer vision
T
Tongtong Bai
W
Wei Song
Weiran He
Weiran He
Unknown affiliation
W
Weixiao Huang
W
Weixin Xu
X
Xiaokun Yuan
Xingcheng Yao
Xingcheng Yao
Moonshot AI
X
Xingzhe Wu
X
Xinxing Zu
X
Xinyu Zhou
X
Xinyuan Wang
Y
Y. Charles
Y
Yan Zhong
Y
Yang Li
Y
Yangyang Hu
Y
Yanru Chen
Yejie Wang
Yejie Wang
Beijing University of Posts and Telecommunications
Natural Language Processing
Y
Yibo Liu
Yibo Miao
Yibo Miao
Shanghai Jiao Tong University; Moonshot
Deep LearningNatural Language ProcessingLarge Language Models
Y
Yidao Qin
Yimin Chen
Yimin Chen
City University of Hong Kong
Medical imagingComputer Vision
Y
Yiping Bao
Y
Yiqin Wang
Y
Yongsheng Kang
Yuanxin Liu
Yuanxin Liu
Peking University
Natural Language Processing
Yulun Du
Yulun Du
Carnegie Mellon University
Deep LearningNatural Language ProcessingHuman-AI Interaction
Y
Yuxin Wu
Yuzhi Wang
Yuzhi Wang
Research Engineer @ Megvii Inc.
Computer VisionArtificial IntelligenceWireless Sensor Network
Yuzi Yan
Yuzi Yan
Tsinghua University, Moonshot AI
Robustness in RLLLMmLLMRobotics
Z
Zaida Zhou
Zhaowei Li
Zhaowei Li
Moonshot AI
Computer VisionNatural Language Processing
Z
Zhejun Jiang
Z
Zheng Zhang
Zhilin Yang
Zhilin Yang
Carnegie Mellon University
Deep LearningMachine LearningNatural Language Processing
Z
Zhiqi Huang
Z
Zihao Huang
Zijia Zhao
Zijia Zhao
Institute of Automation, Chinese Academy Sciences (CASIA)
Multimodal learning
Ziwei Chen
Ziwei Chen
The Hong Kong Polytechnic University
Computer GraphicsCreative Media