🤖 AI Summary
Existing text-to-video generation models struggle to accurately represent diverse cultural characteristics within a single prompt. To address this limitation, this work proposes MAVEN, a multi-agent framework that decouples prompts into three distinct dimensions—person, action, and location—and employs specialized agents operating in parallel or sequentially to collaboratively optimize cultural fidelity in both monocultural and cross-cultural scenarios. We introduce the first multi-agent prompt optimization mechanism tailored for multicultural representation and construct the first text-to-video evaluation benchmark encompassing Chinese, American, and Romanian cultures. Experimental results demonstrate that the parallel specialization strategy significantly enhances cultural relevance while preserving high visual quality and temporal consistency.
📝 Abstract
Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM-SCU/CRAFT