CoMP: Continual Multimodal Pre-training for Vision Foundation Models

📅 2025-03-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the challenges of variable-resolution input adaptation and insufficient vision-language representation alignment in Vision Foundation Models (VFMs), this paper proposes CoMP-AIMv2. It introduces Continual Rotary Position Embedding to enable seamless multi-scale image input support; designs a cross-modal feature alignment loss that enforces consistency between visual and textual representation spaces without updating the frozen backbone; and establishes a unified multimodal pretraining paradigm that requires no task-specific fine-tuning. Evaluated on ChartQA (64.9, using a 0.5B LLM), ImageNet-1K (87.3% top-1 accuracy), and ADE20K (51.8 mIoU), CoMP-AIMv2 achieves substantial improvements in both multimodal understanding and unimodal perception tasks. These results validate its capability to learn general-purpose visual representations and demonstrate the effectiveness of its efficient, alignment-driven architecture.

Technology Category

Application Category

📝 Abstract

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks. Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM, while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20K under frozen chunk evaluation.

Problem

Research questions and friction points this paper is trying to address.

Aligns visual and language representations in VFMs

Supports varying input sizes via continual pre-training

Enhances multimodal and downstream task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual Rotary Position Embedding for resolution

Alignment Loss with language prototypes

Three-stage training for multimodal alignment

🔎 Similar Papers

No similar papers found.