🤖 AI Summary
This work introduces Kimi-VL—the first efficient, open-source Mixture-of-Experts (MoE) vision-language model—designed to overcome efficiency and performance bottlenecks in multimodal large models for long-context understanding, complex reasoning, and agent capabilities. Methodologically: (1) it proposes MoonViT, a native high-resolution visual encoder enabling fine-grained image and video perception; (2) it constructs a language decoder MoE architecture with only 2.8B active parameters; and (3) it introduces Kimi-VL-Thinking, a long-chain reasoning variant integrating chain-of-thought (CoT) supervised fine-tuning and reinforcement learning for multi-step inference optimization. Experiments demonstrate that Kimi-VL achieves state-of-the-art results on long-context and OCR benchmarks—including LongVideoBench (64.5), MMLongBench-Doc (35.1), and InfoVQA (83.2). Kimi-VL-Thinking further surpasses closed-source models such as GPT-4o on MMMU (61.7) and MathVista (71.3), validating its superiority in university-level multimodal reasoning and agent-oriented tasks.
📝 Abstract
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.