Decoupled Global-Local Alignment for Improving Compositional Understanding

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP’s global contrastive learning struggles to model compositional concepts (e.g., attributes and relations), and existing methods incorporating global hard negatives improve compositional reasoning at the cost of severely degrading general-purpose representation capability. To address this trade-off, we propose a decoupled global-local alignment framework: (i) fine-grained alignment via image-guided contrastive (IGC) and text-guided contrastive (TGC) losses; (ii) a self-distillation mechanism to mitigate knowledge forgetting; and (iii) 2M high-quality, fine-grained negative samples automatically generated by a large language model. Our method achieves +3.5% average improvement on compositional benchmarks (VALSE, SugarCrepe, ARO) and +13.0% average zero-shot classification accuracy across 11 standard benchmarks. Crucially, it is the first approach to significantly enhance compositional understanding while effectively preserving CLIP’s inherent generalizability.

Technology Category

Application Category

📝 Abstract
Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA
Problem

Research questions and friction points this paper is trying to address.

Improves CLIP's compositional understanding without losing general capabilities
Uses self-distillation to retain pretrained knowledge during fine-tuning
Enhances vision-language alignment via novel contrastive losses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Global-Local Alignment (DeGLA) framework
Self-distillation mechanism in global alignment
Image-Grounded and Text-Grounded Contrast losses
🔎 Similar Papers
No similar papers found.
Xiaoxing Hu
Xiaoxing Hu
M.S. in Beijing Institute of Technology
Computer VisionMulti-Modal Learning
Kaicheng Yang
Kaicheng Yang
DeepGlint
Multimodal、CV、NLP
J
Jun Wang
DeepGlint, Beijing, China
H
Haoran Xu
Zhejiang University, Zhejiang Province, China
Z
Ziyong Feng
DeepGlint, Beijing, China
Y
Yupei Wang
Beijing Institute of Technology, Beijing, China