🤖 AI Summary
To address the challenges of variable-resolution input adaptation and insufficient vision-language representation alignment in Vision Foundation Models (VFMs), this paper proposes CoMP-AIMv2. It introduces Continual Rotary Position Embedding to enable seamless multi-scale image input support; designs a cross-modal feature alignment loss that enforces consistency between visual and textual representation spaces without updating the frozen backbone; and establishes a unified multimodal pretraining paradigm that requires no task-specific fine-tuning. Evaluated on ChartQA (64.9, using a 0.5B LLM), ImageNet-1K (87.3% top-1 accuracy), and ADE20K (51.8 mIoU), CoMP-AIMv2 achieves substantial improvements in both multimodal understanding and unimodal perception tasks. These results validate its capability to learn general-purpose visual representations and demonstrate the effectiveness of its efficient, alignment-driven architecture.
📝 Abstract
Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks. Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM, while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20K under frozen chunk evaluation.