DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing CLIP adaptation methods for fine-grained domains—such as biology—suffer from limited domain performance and degraded generalization due to naïve full-model fine-tuning that ignores domain-specific characteristics (e.g., fine-grained discriminability). To address this, we propose DALIP, a distribution-alignment-based vision-language pretraining framework. Unlike conventional [CLS]-token-level contrastive learning, DALIP introduces a novel paradigm aligning image-text feature distributions via first- and second-order statistics. We design a Multi-Head Brownian Distance Covariance (MBDC) module to efficiently capture token-level second-order correlations. Furthermore, we construct PlantMix-13M—the first large-scale, 13-million-sample hybrid plant dataset. Experiments demonstrate that DALIP significantly outperforms state-of-the-art CLIP variants in biological domains and exhibits strong cross-domain generalization to remote sensing and medical imaging. When trained on PlantMix-13M, DALIP further boosts plant recognition accuracy while preserving robust general-purpose representation capability.

Technology Category

Application Category

📝 Abstract

Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.

Problem

Research questions and friction points this paper is trying to address.

Improves CLIP for fine-grained biological data using distribution alignment

Proposes Multi-head Brownian Distance Covariance for efficient feature statistics

Introduces PlantMix-13M dataset to enhance domain-specific performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses distribution alignment for image-text pairs

Introduces Multi-head Brownian Distance Covariance module

Combines domain-specific and general-domain data

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training