Nucleus-Image: Sparse MoE for Image Generation

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work proposes the first high-performance, open-source sparse Mixture-of-Experts (MoE) diffusion Transformer model, achieving state-of-the-art image generation quality while significantly reducing inference costs. To address routing instability induced by timestep modulation, the model incorporates an Expert-Choice routing mechanism alongside a decoupled routing strategy that separates joint attention from timestep conditioning. An efficient training pipeline is established through multi-stage data filtering, progressive-resolution curriculum learning, and the use of the Muon optimizer. With only approximately 2 billion active parameters, the model matches or surpasses current leading approaches on GenEval, DPG-Bench, and OneIG-Bench, delivering comparable or superior performance at substantially lower inference cost—without requiring any post-training optimization.

Technology Category

Application Category

📝 Abstract

We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

quality-efficiency trade-off

sparse mixture-of-experts

diffusion model

inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse MoE

Expert-Choice Routing

Diffusion Transformer