Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses a fundamental limitation of standard Classifier-Free Guidance (CFG), which employs a global scalar guidance weight and incurs orthogonal bias due to tangential linear extrapolation on curved data manifolds, thereby struggling to balance detail preservation and artifact suppression. From a differential geometric perspective, this paper first analyzes the root cause of CFG’s shortcomings and proposes SAMG—a training-free, nearly zero-overhead spatially adaptive multi-guidance mechanism grounded in the Tweedie formula. SAMG dynamically computes per-pixel conditional guidance strengths, enabling region-specific sampling by blending conservative and aggressive guidance strategies. Experiments demonstrate that SAMG consistently enhances semantic alignment, structural integrity, and temporal coherence across diverse image and video diffusion models, effectively overcoming the long-standing trade-off between fine details and generation artifacts.

📝 Abstract

Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Classifier-Free Guidance

detail-artifact dilemma

diffusion models

spatial adaptation

guidance scale

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Adaptive Guidance

Classifier-Free Guidance

Diffusion Models