FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing fine-grained vision-language models struggle to accurately model object attributes, spatial relations, and long-text semantics in bilingual (Chinese–English) settings, and lack Chinese-adapted evaluation benchmarks. To address this, we propose a Text-Intra-Contrastive (TIC) loss to enhance discriminability among semantically similar textual descriptions. We introduce CM-FineEval—the first fine-grained multimodal understanding benchmark tailored for Chinese—supporting long-text retrieval and bounding-box classification. Furthermore, leveraging English–Chinese mixed training data, we jointly optimize region–text matching, fine-grained supervised learning, and multiple discriminative objectives. Our method achieves state-of-the-art performance across 29 datasets and 8 tasks in both Chinese and English. We publicly release our models, code, and the CM-FineEval benchmark to advance research on bilingual fine-grained cross-modal alignment.

Technology Category

Application Category

📝 Abstract

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

Problem

Research questions and friction points this paper is trying to address.

Improves fine-grained vision-language alignment for bilingual contexts

Addresses limitations in capturing object attributes and spatial relations

Enhances bilingual comprehension through novel training objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual vision-language model for English and Chinese

Uses fine-grained supervision with region-text matching

Introduces Textual Intra-modal Contrastive loss for captions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs