Seedream 3.0 Technical Report

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Seedream 3.0 addresses key limitations of Seedream 2.0—including inadequate complex prompt alignment, suboptimal Chinese typography rendering, insufficient visual aesthetic modeling, constrained image fidelity, and fixed 1K-resolution output—by introducing the first bilingual (Chinese–English) foundational image generation model natively supporting 2K-resolution output. Methodologically, it proposes a novel defect-aware training paradigm and a dual-axis collaborative sampling framework, incorporating a consistent noise expectation acceleration mechanism. It further integrates hybrid-resolution training, cross-modal RoPE positional encoding, representation alignment loss, resolution-aware timestep sampling, vision-language model (VLM)-based reward modeling, and supervised fine-tuning (SFT) with aesthetic annotations. Experiments demonstrate substantial improvements: significantly enhanced Chinese text rendering quality, markedly improved alignment with human aesthetic preferences, state-of-the-art image fidelity and typographic professionalism, and 4–8× faster inference speed.

Technology Category

Application Category

📝 Abstract
We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
Problem

Research questions and friction points this paper is trying to address.

Improves alignment with complex prompts in image generation
Enhances fine-grained typography and text-rendering capabilities
Increases image resolution and visual fidelity significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defect-aware training and dual-axis data sampling
Mixed-resolution training and cross-modality RoPE
Consistent noise expectation for speedup
🔎 Similar Papers
No similar papers found.
Y
Yu Gao
L
Lixue Gong
Qiushan Guo
Qiushan Guo
The University of Hong Kong; ByteDance
Deep LearningComputer Vision
X
Xiaoxia Hou
Z
Zhichao Lai
F
Fanshi Li
L
Liang Li
Xiaochen Lian
Xiaochen Lian
ByteDance Research
Machine LearningComputer Vision
C
Chao Liao
L
Liyang Liu
W
Wei Liu
Yichun Shi
Yichun Shi
ByteDance
Computer VisionMachine Learning
S
Shiqi Sun
Y
Yu Tian
Z
Zhi Tian
P
Peng Wang
R
Rui Wang
X
Xuanda Wang
X
Xun Wang
Y
Ye Wang
Guofeng Wu
Guofeng Wu
J
Jie Wu
X
Xin Xia
Xuefeng Xiao
Xuefeng Xiao
ByteDance Seed
Computer VisionEfficient AI
Zhonghua Zhai
Zhonghua Zhai
Alibaba
Deep learning
X
Xinyu Zhang
Q
Qi Zhang
Y
Yuwei Zhang
S
Shijia Zhao
J
Jianchao Yang
Weilin Huang
Weilin Huang
Bytedance Seed
Computer VisionDeep Learning