CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the reliance on large-scale data in domain adaptation of CLIP models, this paper proposes CHIPS—a framework for efficient, scalable, and knowledge-preserving continual pre-training via curvature-aware and influence-informed data selection. Methodologically, CHIPS introduces three key innovations: (i) Newton-style curvature alignment to preserve geometric structure during adaptation; (ii) InfoNCE curvature estimation under Johnson–Lindenstrauss random projection for scalable computation; and (iii) a selection-aware joint weighting mechanism balancing relevance and learnability, supported by a theoretical lower bound guarantee. Empirically, CHIPS achieves full fine-tuning performance on 17 medical benchmarks using only 30% of the data, and sustains minimal performance degradation—outperforming baselines—on 31 general-domain benchmarks when trained with just 10%–30% of the data. The framework thus uniquely balances domain specialization with robust general semantic understanding.

Technology Category

Application Category

📝 Abstract
Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by continual pre-training (CPT) on large domain-specific datasets. Yet, data itself remains an underexplored factor in this process. We revisit this task from a data-centric perspective: Can effective data selection substitute for large-scale datasets in CPT? We introduce CHIPS (Curvature-aware Hybrid Influence in Projection Subspace), which assigns each image-text pair a utility score that integrates three complementary factors aligned with three goals: faithfulness via a curvature-aware, Newton-style alignment computed in CLIP's end-point subspace; scalability via an InfoNCE-aware curvature estimator with Johnson-Lindenstrauss (JL) sketching; and retention via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower-bound guarantee on the proxy's correlation with full-parameter alignment and by characterizing the bias-variance trade-offs introduced by curvature mixing and JL sketching. We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 medical benchmarks, matches full-dataset CPT with 30% of the data, and outperforms half-dataset CPT using only 10%; 2) on 31 general-domain benchmarks, CHIPS yields the smallest performance drop under 10-30% data-retention budgets. Code, data, and checkpoints will be released.
Problem

Research questions and friction points this paper is trying to address.

Selecting optimal data subsets for efficient CLIP adaptation to vertical domains
Developing curvature-aware hybrid scoring to replace full dataset fine-tuning
Balancing domain adaptation with general knowledge preservation through data selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curvature-aware hybrid influence scoring for data selection
Johnson-Lindenstrauss sketching enables scalable curvature estimation
Selection-aware relevance balancing target adaptation and preservation
🔎 Similar Papers
No similar papers found.
X
Xinlin Zhuang
MBZUAI
Y
Yichen Li
Huazhong University of Science and Technology
X
Xiwei Liu
MBZUAI
Haolin Yang
Haolin Yang
University of Chicago
large language modelsnatural language processing
Y
Yifan Lu
MBZUAI
Z
Ziyun Zou
MBZUAI
Y
Yulong Li
MBZUAI
Huifa Li
Huifa Li
East China Normal University
Deep LearningGraph Neural NetworkLLMAI4Science
D
Dongliang Chen
East China Normal University
Q
Qinglei Wang
MBZUAI
Weiyang Liu
Weiyang Liu
CUHK | Max Planck Institute for Intelligent Systems
Machine LearningArtificial IntelligenceComputer Vision
Y
Ying Qian
East China Normal University
J
Jiangming Shi
Xiamen University
Imran Razzak
Imran Razzak
MBZUAI, Abu Dhabi
Human-Centered AIMedical Image AnalysisMedical Artificial IntelligenceComputational Biology