Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While one-step generators distilled from Masked Diffusion Models (MDMs) offer efficiency, they inherit modeling biases from teacher models and suffer from non-differentiability due to discrete token outputs, preventing adversarial refinement, reward-based fine-tuning, and test-time embedding optimization (TTEO). Method: We propose a soft-embedding mechanism that replaces discrete token outputs with continuous distributional expectations, preserving the expressive power of discrete generators while enabling end-to-end differentiability. Contribution/Results: This is the first approach to render one-step generative models compatible with GAN-based refinement, reward-driven fine-tuning, and TTEO. Integrated into the Di[M]O distillation framework, our method achieves a state-of-the-art FID of 1.56 on ImageNet-256. In text-to-image generation, it significantly improves GenEval and HPS scores, with further gains attained via TTEO.

Technology Category

Application Category

📝 Abstract
One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.
Problem

Research questions and friction points this paper is trying to address.

Overcoming modeling bias and gradient flow issues in one-step image generators
Enabling post-distillation refinements through differentiable continuous embeddings
Improving one-step discrete image generation performance across multiple metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft embeddings replace discrete tokens with expected embeddings
Enables end-to-end trainable one-step generators
Supports GAN refinement and differentiable reward fine-tuning