Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing approaches to ultra-high-resolution image generation based on pretrained Latent Diffusion Models (LDMs) struggle to simultaneously preserve global structure and fine details due to forced patch-wise feature distillation, which disrupts the latent manifold. To address this, this work proposes a Spatial Gram Alignment (SGA) framework that non-invasively aligns internal LDM features with the self-similarity structures of vision foundation models—such as SAM and DINO—without perturbing the native latent space. SGA is the first method to jointly optimize macro-structural consistency and micro-detail fidelity in ultra-high-resolution text-to-image synthesis. Compatible with both intermediate diffusion layers and the VAE latent space, the approach significantly enhances global coherence and local realism, achieving state-of-the-art performance.
📝 Abstract
Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.
Problem

Research questions and friction points this paper is trying to address.

ultra-high-resolution image synthesis
representation alignment
Latent Diffusion Models
feature distillation
learnability-fidelity conflict
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Gram Alignment
Latent Diffusion Models
Representation Alignment
Ultra-High-Resolution Synthesis
Vision Foundation Models
🔎 Similar Papers
No similar papers found.