FG-CLIP: Fine-Grained Visual and Textual Alignment

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

CLIP excels in image–text retrieval and zero-shot classification but struggles with fine-grained semantic distinctions due to coarse-grained, short-text supervision. To address this, we propose FineCLIP—a comprehensive framework enhancing fine-grained cross-modal understanding through innovations in data, modeling, and training. First, we construct the first large-scale long-image–text dataset comprising 600 million pairs, augmented with 12 million images and 40 million region-level image–text alignments. Second, we leverage large vision-language models to generate fine-grained, descriptive textual annotations and explicitly model region–text alignment. Third, we introduce hard negative mining and a customized contrastive learning strategy tailored for fine-grained discrimination. Extensive experiments demonstrate that FineCLIP consistently outperforms CLIP and prior state-of-the-art methods across fine-grained recognition, open-vocabulary detection, image–text retrieval, and general multimodal benchmarks.

Technology Category

Application Category

📝 Abstract

Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The related data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.

Problem

Research questions and friction points this paper is trying to address.

Enhances fine-grained understanding in multimodal tasks

Generates detailed captions for precise semantic alignment

Improves discrimination of subtle visual-textual differences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates 1.6B long caption-image pairs

Constructs 12M images with 40M region boxes

Incorporates 10M hard fine-grained negatives

🔎 Similar Papers

No similar papers found.