Multi-Agent Multimodal Models for Multicultural Text to Image Generation

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing large language models (LLMs) exhibit limited performance on cross-cultural multimodal tasks due to Western-centric data curation and modeling paradigms. To address this, we propose MosAIG—the first culture-persona-driven multi-agent image generation framework—where multiple LLMs, each endowed with distinct cultural identities (spanning five countries, three generations, two genders, twenty-five landmarks, and five languages), collaboratively generate culturally contextualized images. Our key contributions are: (1) a culture-persona-guided multi-role collaborative reasoning mechanism; (2) Multicultural, the first 9,000-sample cross-cultural image dataset; and (3) a novel generation paradigm integrating cross-modal alignment with culture-aware prompt engineering. Experiments demonstrate that MosAIG significantly outperforms single-model baselines across cultural consistency, image fidelity, and semantic alignment metrics. All models and datasets are publicly released to advance equitable and inclusive multimodal AI research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.

Problem

Research questions and friction points this paper is trying to address.

Multicultural image generation enhancement

Multi-agent interaction in LLMs

Cross-cultural data diversity expansion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent framework enhances multicultural Image

Leverages LLMs with distinct cultural personas

Multi-agent interactions outperform simple models

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval