π€ AI Summary
Current vision-language models (e.g., CLIP) struggle to discriminate visually similar yet culturally distinct concepts, primarily due to insufficient high-quality culture-specific data, weak contextual awareness, and the absence of challenging negative samples. To address this, we propose the first cultural-aware synthetic data construction paradigm, introducing CulTwinβa culturally grounded twin dataset. CulTwin generates concept-text-image triplets via open-source text-to-image diffusion models guided by culturally informed, context-enriched textual descriptions, and incorporates culturally aligned hard negatives. We further design a customized contrastive learning strategy to fine-tune CLIP. Experiments demonstrate substantial improvements over baselines across multiple cultural understanding benchmarks, with up to a 5.49% gain in fine-grained recognition accuracy, while preserving generalization performance on standard vision-language tasks.
π Abstract
Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP's original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.