MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak foundational capabilities, poor generalization, and high adaptation costs in post-training multimodal large language models (MLLMs), this paper proposes a universal post-training paradigm. Methodologically, it introduces: (1) information-density-driven generation of high-quality, cross-domain multimodal data; (2) collaborative curriculum-supervised fine-tuning guided by a dual-dimensional hierarchical taxonomy of labels; and (3) a multi-objective hybrid reinforcement learning framework balancing reasoning accuracy, response conciseness, and diversity-aware exploration. Integrated with efficient infrastructure—including 5D parallel training, operator optimization, and inference quantization—the approach enables low-cost domain adaptation. Empirically, it achieves state-of-the-art performance on major benchmarks including MMBench, MMStar, MathVision, and MathVista, and significantly improves user experience in vertical applications. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract
We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal large language models through a comprehensive post-training framework.
Improving model generalization while maintaining specialized knowledge and multimodal perception.
Reducing computational costs for training and deploying advanced multimodal AI systems.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated high-quality cross-domain data generation via density-based scheme
Collaborative curriculum fine-tuning balancing domain knowledge and general capabilities
Hybrid reinforcement learning enhancing reasoning with multi-objective optimization
🔎 Similar Papers
No similar papers found.
W
Wei Chen
MindGPT-4o Team, Li Auto Inc.
Chaoqun Du
Chaoqun Du
Department of Automation, Tsinghua University
F
Feng Gu
MindGPT-4o Team, Li Auto Inc.
W
Wei He
MindGPT-4o Team, Li Auto Inc.
Q
Qizhen Li
MindGPT-4o Team, Li Auto Inc.
Zide Liu
Zide Liu
Zhejiang University
Diffusion ModelsVideo Editing
X
Xuhao Pan
MindGPT-4o Team, Li Auto Inc.
C
Chang Ren
MindGPT-4o Team, Li Auto Inc.
X
Xudong Rao
MindGPT-4o Team, Li Auto Inc.
C
Chenfeng Wang
MindGPT-4o Team, Li Auto Inc.
T
Tao Wei
MindGPT-4o Team, Li Auto Inc.
C
Chengjun Yu
MindGPT-4o Team, Li Auto Inc.
Pengfei Yu
Pengfei Yu
University of Illinois at Urbana-Champaign
Natural Language ProcessingMachine Learning
Y
Yufei Zheng
MindGPT-4o Team, Li Auto Inc.
C
Chunpeng Zhou
MindGPT-4o Team, Li Auto Inc.
P
Pan Zhou
MindGPT-4o Team, Li Auto Inc.
Xuhan Zhu
Xuhan Zhu
UCAS
Computer VisionVision Language Model