VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) process images at the token level, resulting in low computational efficiency and a lack of human-like, concept-level understanding. To address this, we propose the first end-to-end self-supervised visual concept modeling framework that jointly integrates implicit contrastive learning with vision-language instruction tuning—requiring no concept-level annotations—to learn interpretable and transferable visual concept representations. Our method introduces a multi-instance sampling contrastive mechanism and a concept-aware visual encoder optimization. Evaluated on LLaVA-1.5-7B, it reduces FLOPs by 85% while preserving multi-task image understanding performance and significantly improving visual concept recognition accuracy. This work breaks the prevailing paradigm in LVLMs that relies solely on pixel- or patch-level modeling, establishing a novel pathway toward efficient, interpretable, and semantically grounded vision-language understanding.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs' usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders' capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.
Problem

Research questions and friction points this paper is trying to address.

LVLMs inefficiently process images at token level
Lack of visual concept model limits LVLMs' usability
VCM reduces computational costs without performance loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit contrastive learning for visual concepts
Vision-language instruction fine-tuning framework
Self-supervised end-to-end concept modeling
🔎 Similar Papers
No similar papers found.