Multi-Agent Multimodal Models for Multicultural Text to Image Generation

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) exhibit limited performance on cross-cultural multimodal tasks due to Western-centric data curation and modeling paradigms. To address this, we propose MosAIG—the first culture-persona-driven multi-agent image generation framework—where multiple LLMs, each endowed with distinct cultural identities (spanning five countries, three generations, two genders, twenty-five landmarks, and five languages), collaboratively generate culturally contextualized images. Our key contributions are: (1) a culture-persona-guided multi-role collaborative reasoning mechanism; (2) Multicultural, the first 9,000-sample cross-cultural image dataset; and (3) a novel generation paradigm integrating cross-modal alignment with culture-aware prompt engineering. Experiments demonstrate that MosAIG significantly outperforms single-model baselines across cultural consistency, image fidelity, and semantic alignment metrics. All models and datasets are publicly released to advance equitable and inclusive multimodal AI research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.
Problem

Research questions and friction points this paper is trying to address.

Multicultural image generation enhancement
Multi-agent interaction in LLMs
Cross-cultural data diversity expansion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent framework enhances multicultural Image
Leverages LLMs with distinct cultural personas
Multi-agent interactions outperform simple models
P
Parth Bhalerao
Santa Clara University - Santa Clara, USA
M
Mounika Yalamarty
Santa Clara University - Santa Clara, USA
B
Brian Trinh
Santa Clara University - Santa Clara, USA
Oana Ignat
Oana Ignat
Assistant Professor of Computer Science at Santa Clara University
AIMachine LearningComputer VisionNatural Language ProcessingMathematics