MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing text-to-video generation models struggle to accurately represent diverse cultural characteristics within a single prompt. To address this limitation, this work proposes MAVEN, a multi-agent framework that decouples prompts into three distinct dimensions—person, action, and location—and employs specialized agents operating in parallel or sequentially to collaboratively optimize cultural fidelity in both monocultural and cross-cultural scenarios. We introduce the first multi-agent prompt optimization mechanism tailored for multicultural representation and construct the first text-to-video evaluation benchmark encompassing Chinese, American, and Romanian cultures. Experimental results demonstrate that the parallel specialization strategy significantly enhances cultural relevance while preserving high visual quality and temporal consistency.

📝 Abstract

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM-SCU/CRAFT

Problem

Research questions and friction points this paper is trying to address.

text-to-video generation

cultural representation

multicultural fidelity

prompt engineering

cross-cultural generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent framework

cultural fidelity

text-to-video generation