Training-Free Self-Correction for Multimodal Masked Diffusion Models

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal masked diffusion models are prone to error accumulation when synchronously updating multiple tokens, as early inaccuracies propagate through the generation process. This work proposes a training-free self-correcting sampling framework that leverages the inherent inductive biases of pretrained masked diffusion models to dynamically refine generations without modifying model parameters or introducing additional evaluators. Notably, this approach achieves self-correction for the first time without requiring extra training or auxiliary models, demonstrating broad applicability across diverse masked diffusion architectures. Experimental results show consistent improvements in generation quality on both text-to-image synthesis and multimodal understanding tasks, while simultaneously reducing the number of required sampling steps.

Technology Category

Application Category

📝 Abstract
Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self-correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion architectures, highlighting its robustness and practical applicability. Code can be found in https://github.com/huge123/FreeCorrection.
Problem

Research questions and friction points this paper is trying to address.

masked diffusion models
error accumulation
self-correction
multimodal generation
immutable tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
self-correction
masked diffusion models
multimodal generation
inductive bias
🔎 Similar Papers
No similar papers found.