Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary image generation models suffer from multilingual prompt misinterpretation, inadequate modeling of Chinese cultural semantics, and low-fidelity text rendering. To address these issues, Seedream 2.0 introduces the first native bilingual (Chinese–English) foundation model for text-to-image generation. It innovatively employs a bilingual large language model as its text encoder, integrates Glyph-Aligned ByT5—a character-level text rendering mechanism—and adopts Scaled ROPE for cross-resolution positional encoding. The model is optimized via multi-stage supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Extensive experiments demonstrate state-of-the-art performance in prompt adherence, aesthetic quality, text rendering accuracy, and structural coherence, achieving superior ELO scores. Moreover, Seedream 2.0 enables high-fidelity, instruction-consistent image editing (e.g., SeedEdit), marking significant progress in culturally aware, multilingual generative modeling.

Technology Category

Application Category

📝 Abstract
Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.
Problem

Research questions and friction points this paper is trying to address.

Addresses model bias in image generation
Enhances Chinese-English bilingual text rendering
Improves understanding of Chinese cultural nuances
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual image generation with Chinese-English support
Self-developed bilingual large language model integration
Glyph-Aligned ByT5 for character-level text rendering
🔎 Similar Papers
No similar papers found.
L
Lixue Gong
X
Xiaoxia Hou
F
Fanshi Li
L
Liang Li
Xiaochen Lian
Xiaochen Lian
ByteDance Research
Machine LearningComputer Vision
F
Fei Liu
W
Wei Liu
W
Wei Lu
Yichun Shi
Yichun Shi
ByteDance
Computer VisionMachine Learning
S
Shiqi Sun
Y
Yu Tian
Z
Zhi Tian
P
Peng Wang
X
Xun Wang
Y
Ye Wang
Guofeng Wu
Guofeng Wu
J
Jie Wu
X
Xin Xia
Xuefeng Xiao
Xuefeng Xiao
ByteDance Seed
Computer VisionEfficient AI
Linjie Yang
Linjie Yang
ByteDance Inc.
Computer VisionMachine Learning
Zhonghua Zhai
Zhonghua Zhai
Alibaba
Deep learning
X
Xinyu Zhang
Q
Qi Zhang
Y
Yuwei Zhang
S
Shijia Zhao
J
Jianchao Yang
Weilin Huang
Weilin Huang
Bytedance Seed
Computer VisionDeep Learning