🤖 AI Summary
Multimodal large language models (MLLMs) suffer from low training and inference efficiency, particularly at scale. To address this, we propose an efficient MLLM framework targeting the 8B-parameter regime. Our method introduces a unified 3D-Resampler architecture for compact joint encoding of images and videos; a lightweight, integrated multi-task learning paradigm for document understanding and text recognition—eliminating complex data engineering; and a hybrid reinforcement learning strategy that jointly optimizes short- and long-horizon reasoning capabilities. Leveraging high-quality image-text-video data alongside model compression techniques, our approach achieves state-of-the-art performance on VideoMME with only 46.7% GPU memory consumption and 8.7% inference latency relative to baselines, while surpassing GPT-4o-latest and Qwen2.5-VL-72B on OpenCompass.
📝 Abstract
Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7% GPU memory cost and 8.7% inference time of Qwen2.5-VL 7B.