🤖 AI Summary
To address the challenges of deploying multimodal large language models (MLLMs) on edge devices—namely, excessive parameter count, high inference overhead, and limited reasoning capability—this paper introduces BlueLM-2.5-3B, a compact MLLM with only 290M parameters. It is the first 3B-scale MLLM to support explicit, controllable “think/non-think” dual-mode inference, enabling dynamic allocation of reasoning token budgets. Our method leverages diverse data construction, critical sample resampling, hybrid heterogeneous reinforcement learning, and an efficient training architecture—achieving strong performance with significantly reduced training data. Specifically, BlueLM-2.5-3B matches Qwen3-4B on pure-text tasks, attains 95% of Kimi-VL-A3B-16B’s average multimodal performance, and outperforms Qwen2.5-VL-3B in non-think mode. The model thus achieves a favorable trade-off among edge-deployment efficiency, general multimodal understanding, and structured reasoning capability.
📝 Abstract
We present BlueLM-2.5-3B, a compact and unified dense Multimodal Large Language Model (MLLM) designed for efficient edge-device deployment, offering strong general-purpose and reasoning capabilities. To the best of our knowledge, this is the first 3B-scale MLLM to support both thinking and non-thinking modes, while also enabling explicit control over thinking token budget. BlueLM-2.5-3B is developed through diversified data curation, key data resampling, hybrid heterogeneous reinforcement learning, and a high-performance training infrastructure. Our model achieves superior multimodal capacity while preserving competitive pure-text performance with only 2.9 billion parameters. We conduct comprehensive evaluations across a broad range of multimodal and text-only benchmarks. In thinking mode, BlueLM-2.5-3B achieves comparable performance to Qwen3-4B on text-only benchmarks, and trails the larger Kimi-VL-A3B-16B by only about 5% on average across multimodal evaluations. In non-thinking mode, it outperforms Qwen2.5-VL-3B on the majority of multimodal benchmarks. Additionally, BlueLM-2.5-3B exhibits exceptional data efficiency. All of the aforementioned performance is achieved with substantially less total training data than Qwen2.5-VL-3B and Qwen3-4B. We hope our work contributes to the advancement of high-performance, on-device MLLMs and provides meaningful insights to the research community.