MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Diffusion models excel at text-to-image generation but suffer from weak controllability and semantic inconsistency in vision-guided local editing and compositional structure control. To address these limitations, we propose Masking-Augmented Gaussian Diffusion (MAgD), a novel framework that enhances local-global semantic alignment via mask-augmented training and dual corruption modeling. During inference, we introduce a dynamic computational expansion mechanism based on a Pause Token, enabling adaptive resource allocation for fine-grained editing regions. MAgD integrates four key components: mask reconstruction, denoising score matching, context-aware generation modeling, and high-density prompt learning. Extensive experiments demonstrate significant improvements in text-image alignment accuracy, local edit fidelity, and compositional generation consistency. Our method achieves state-of-the-art performance across multiple benchmarks, advancing diffusion models toward structure-aware, context-adaptive general-purpose generation architectures.

Technology Category

Application Category

📝 Abstract

Despite the remarkable success of diffusion models in text-to-image generation, their effectiveness in grounded visual editing and compositional control remains challenging. Motivated by advances in self-supervised learning and in-context generative modeling, we propose a series of simple yet powerful design choices that significantly enhance diffusion model capacity for structured, controllable generation and editing. We introduce Masking-Augmented Diffusion with Inference-Time Scaling (MADI), a framework that improves the editability, compositionality and controllability of diffusion models through two core innovations. First, we introduce Masking-Augmented gaussian Diffusion (MAgD), a novel training strategy with dual corruption process which combines standard denoising score matching and masked reconstruction by masking noisy input from forward process. MAgD encourages the model to learn discriminative and compositional visual representations, thus enabling localized and structure-aware editing. Second, we introduce an inference-time capacity scaling mechanism based on Pause Tokens, which act as special placeholders inserted into the prompt for increasing computational capacity at inference time. Our findings show that adopting expressive and dense prompts during training further enhances performance, particularly for MAgD. Together, these contributions in MADI substantially enhance the editability of diffusion models, paving the way toward their integration into more general-purpose, in-context generative diffusion architectures.

Problem

Research questions and friction points this paper is trying to address.

Enhancing diffusion models for structured, controllable generation and editing

Improving editability and compositionality via Masking-Augmented gaussian Diffusion

Scaling inference-time capacity using Pause Tokens for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masking-Augmented gaussian Diffusion for structured editing

Inference-time scaling with Pause Tokens

Dual corruption process combining denoising and masking

🔎 Similar Papers

Streamlining Image Editing with Layered Diffusion Brushes