FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

📅 2024-11-18
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of CLIP-style models—namely, insufficient textual capacity, weak long-text modeling, and poor multilingual generalization—this paper proposes an efficient vision-language alignment framework leveraging a frozen large language model (LLM) as the text encoder. Our key contributions are: (1) the first adoption of a frozen LLM as a fixed, high-capacity text encoder; (2) a multi-view prompt distillation mechanism to enhance cross-modal semantic alignment robustness; and (3) face-decoupled attention to enable fine-grained matching between multi-granularity image features and lengthy text sequences. Integrating offline embedding caching with contrastive learning, our method achieves state-of-the-art performance: +4.9% ImageNet top-1 accuracy on CC3M; +44.4% average R@1 gain in multilingual image–text retrieval on YFCC15M; and +34.6% R@1 improvement for long-text retrieval on Urban-1k.

Technology Category

Application Category

📝 Abstract
Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME's superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4% in average image-to-text recall@1 across 36 languages, and by 34.6% in text-to-image recall@1 for long-context retrieval on Urban-1k. Code is available at https://github.com/MIV-XJTU/FLAME.
Problem

Research questions and friction points this paper is trying to address.

Limited data formats challenge language-image pre-training
Traditional text encoders struggle with long-form inputs
Suboptimal generalization in multilingual and long-context retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen large language models as text encoders
Employs multifaceted prompt distillation for semantic alignment
Introduces facet-decoupled attention for efficient computation
🔎 Similar Papers
No similar papers found.
Anjia Cao
Anjia Cao
Xi'an Jiaotong University
Data-Efficient LearningMultimodal LearningMLLMs
X
Xing Wei
School of Software Engineering, Xi’an Jiaotong University
Z
Zhiheng Ma
Shenzhen University of Advanced Technology, Guangdong Provincial Key Laboratory of Computility Microelectronics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences