🤖 AI Summary
Existing fine-grained vision-language models struggle to accurately model object attributes, spatial relations, and long-text semantics in bilingual (Chinese–English) settings, and lack Chinese-adapted evaluation benchmarks. To address this, we propose a Text-Intra-Contrastive (TIC) loss to enhance discriminability among semantically similar textual descriptions. We introduce CM-FineEval—the first fine-grained multimodal understanding benchmark tailored for Chinese—supporting long-text retrieval and bounding-box classification. Furthermore, leveraging English–Chinese mixed training data, we jointly optimize region–text matching, fine-grained supervised learning, and multiple discriminative objectives. Our method achieves state-of-the-art performance across 29 datasets and 8 tasks in both Chinese and English. We publicly release our models, code, and the CM-FineEval benchmark to advance research on bilingual fine-grained cross-modal alignment.
📝 Abstract
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.