Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) suffer from limited fine-grained perception and complex reasoning capabilities, primarily due to the scarcity of high-quality human annotations and the high cost of constructing chain-of-thought (CoT) reasoning data; self-generated annotations often lack accuracy and completeness. Method: We propose Self-Improving Cognition (SIcog), a novel framework introducing “chain-of-description” modeling to enhance visual fine-grained understanding, and integrating structured CoT reasoning with self-consistent data distillation for closed-loop cognitive self-improvement. Contribution/Results: Using only 213K self-generated multimodal samples—without large-scale human annotation—SIcog achieves state-of-the-art performance across multiple benchmarks on both high- and low-resolution MLLMs, outperforming mainstream pretraining methods. This work establishes a new paradigm for autonomous evolution of foundational MLLMs.

Technology Category

Application Category

📝 Abstract
Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) face challenges with fine-grained perception and complex reasoning. Prevalent pre-training approaches focus on enhancing perception by training on high-quality image captions due to the extremely high cost of collecting chain-of-thought (CoT) reasoning data for improving reasoning. While leveraging advanced MLLMs for caption generation enhances scalability, the outputs often lack comprehensiveness and accuracy. In this paper, we introduce Self-Improving Cognition (SIcog), a self-learning framework designed to construct next-generation foundation MLLMs by enhancing their systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we propose chain-of-description, an approach that improves an MLLM's systematic perception by enabling step-by-step visual understanding, ensuring greater comprehensiveness and accuracy. Additionally, we adopt a structured CoT reasoning technique to enable MLLMs to integrate in-depth multimodal reasoning. To construct a next-generation foundation MLLM with self-improved cognition, SIcog first equips an MLLM with systematic perception and reasoning abilities using minimal external annotations. The enhanced models then generate detailed captions and CoT reasoning data, which are further curated through self-consistency. This curated data is ultimately used to refine the MLLM during multimodal pre-training, facilitating next-generation foundation MLLM construction. Extensive experiments on both low- and high-resolution MLLMs across diverse benchmarks demonstrate that, with merely 213K self-generated pre-training samples, SIcog produces next-generation foundation MLLMs with significantly improved cognition, achieving benchmark-leading performance compared to prevalent pre-training approaches.
Problem

Research questions and friction points this paper is trying to address.

Enhance fine-grained perception and complex reasoning in MLLMs
Reduce reliance on costly chain-of-thought reasoning data
Improve comprehensiveness and accuracy of self-generated captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Improving Cognition (SIcog) framework
Chain-of-description for step-by-step perception
Structured CoT reasoning for multimodal integration
🔎 Similar Papers
No similar papers found.