NeoBabel: A Multilingual Open Tower for Visual Generation

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Text-to-image generation has long suffered from English-centric biases, causing semantic distortion, cultural misalignment, and computational redundancy for non-English users. To address this, we propose NeoBabel—the first framework enabling simultaneous high-fidelity English generation and robust cross-lingual generalization in multilingual visual synthesis. Our method mitigates semantic drift via targeted alignment training, introduces a novel multilingual image-text dataset, and establishes standardized evaluation protocols (m-GenEval and m-DPG). Furthermore, NeoBabel integrates large-scale multilingual pretraining with high-resolution instruction tuning. On m-GenEval and m-DPG, NeoBabel achieves state-of-the-art scores of 0.75 and 0.68, respectively—outperforming mainstream models while reducing parameter count by 2–4×. We publicly release the framework to advance inclusive, multilingual generative AI research.

Technology Category

Application Category

📝 Abstract

Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.

Problem

Research questions and friction points this paper is trying to address.

Overcoming English-centric bias in text-to-image generation

Reducing semantic drift and computational overhead in multilingual systems

Enhancing cultural fidelity and inclusivity across six languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual image generation framework NeoBabel

Combines large-scale pretraining and instruction tuning

Open toolkit with code and multilingual datasets

🔎 Similar Papers

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You