Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) exhibit critical deficiencies in culturally safe reasoning across diverse sociocultural contexts. Method: We introduce CROSS, the first cross-cultural, multilingual, vision-grounded benchmark for cultural safety evaluation—covering 16 countries, 14 languages, and three everyday scenarios—and propose CROSS-Eval, a four-dimensional evaluation framework. We further design a novel culture-context-driven vision-language alignment enhancement method, integrating culture-grounded supervised fine-tuning with contrastive preference optimization. Contribution/Results: Evaluated on 21 state-of-the-art LVLMs, we identify a severe cultural safety gap—the top-performing model achieves only 37.73% compliance. After enhancement via our method, GPT-4o demonstrates a 60.14% improvement in cultural awareness and a 55.2% gain in compliance, while preserving general multimodal capabilities. This work establishes the first technical foundation for cross-cultural safety assessment and alignment in multimodal AI.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o's cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Assessing cultural safety in vision-language models globally
Identifying gaps in cultural norm compliance and awareness
Developing strategies to enhance culturally appropriate responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CROSS benchmark for cultural safety evaluation
Proposes CROSS-Eval framework with four key dimensions
Develops fine-tuning and preference tuning enhancement strategies
🔎 Similar Papers
No similar papers found.
Haoyi Qiu
Haoyi Qiu
UCLA
Trustworthy AIMultimodality
K
Kung-Hsiang Huang
Salesforce AI Research
R
Ruichen Zheng
UCLA
Jiao Sun
Jiao Sun
Google DeepMind
Natural Language Generation
N
Nanyun Peng
UCLA