🤖 AI Summary
This work addresses the high computational cost and degraded reconstruction quality in vision-language latent diffusion models for inverse problems, which stem from extensive neural function evaluations and reliance on autoencoder backpropagation. The authors propose a unified Euclidean–Wasserstein-2 gradient flow framework that jointly optimizes prompts and performs posterior sampling within the latent space via a single gradient flow, effectively aligning the prior, observation, and posterior distributions. Innovatively incorporating consistency regularization, the method enables highly efficient few-step inference without requiring autoencoder backpropagation—a first in the field. By integrating few-step latent text-to-image generation with latent-space optimization, the approach achieves state-of-the-art reconstruction quality across multiple classical imaging inverse problems while substantially reducing computational overhead.
📝 Abstract
Vision-Language Latent Diffusion Models (LDMs) (Rombach et al., 2022) provide powerful generative priors for inverse problems. However, existing LDM-based inverse solvers typically require a large number of neural function evaluations (NFEs) and backpropagation through large pretrained components, leading to substantial computational costs and, in some cases, degraded reconstruction quality. We propose a unified Euclidean-Wasserstein-2 gradient-flow framework that jointly performs posterior sampling and prompt optimization in the latent space through a single flow that aligns the prior and posterior with the observed data. Combined with few-step latent text-to-image models, this formulation enables low-NFE inference without backpropagation through autoencoders. Experiments across several canonical imaging inverse problems show that our method achieves state-of-the-art performance with significantly reduced computational cost.