FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing fine-grained vision-language models struggle to accurately model object attributes, spatial relations, and long-text semantics in bilingual (Chinese–English) settings, and lack Chinese-adapted evaluation benchmarks. To address this, we propose a Text-Intra-Contrastive (TIC) loss to enhance discriminability among semantically similar textual descriptions. We introduce CM-FineEval—the first fine-grained multimodal understanding benchmark tailored for Chinese—supporting long-text retrieval and bounding-box classification. Furthermore, leveraging English–Chinese mixed training data, we jointly optimize region–text matching, fine-grained supervised learning, and multiple discriminative objectives. Our method achieves state-of-the-art performance across 29 datasets and 8 tasks in both Chinese and English. We publicly release our models, code, and the CM-FineEval benchmark to advance research on bilingual fine-grained cross-modal alignment.

Technology Category

Application Category

📝 Abstract
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
Problem

Research questions and friction points this paper is trying to address.

Improves fine-grained vision-language alignment for bilingual contexts
Addresses limitations in capturing object attributes and spatial relations
Enhances bilingual comprehension through novel training objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual vision-language model for English and Chinese
Uses fine-grained supervision with region-text matching
Introduces Textual Intra-modal Contrastive loss for captions
🔎 Similar Papers
No similar papers found.
Chunyu Xie
Chunyu Xie
Beihang University; 360 AI Research
Multimodal learningComputer visionMachine learning
B
Bin Wang
360 AI Research
F
Fanjing Kong
360 AI Research
J
Jincheng Li
360 AI Research
D
Dawei Liang
360 AI Research
J
Ji Ao
360 AI Research
Dawei Leng
Dawei Leng
Dr.
Multimodal UnderstandingMultimodal GenerationVision and Language
Y
Yuhui Yin
360 AI Research