Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing approaches to ultra-high-resolution image generation based on pretrained Latent Diffusion Models (LDMs) struggle to simultaneously preserve global structure and fine details due to forced patch-wise feature distillation, which disrupts the latent manifold. To address this, this work proposes a Spatial Gram Alignment (SGA) framework that non-invasively aligns internal LDM features with the self-similarity structures of vision foundation models—such as SAM and DINO—without perturbing the native latent space. SGA is the first method to jointly optimize macro-structural consistency and micro-detail fidelity in ultra-high-resolution text-to-image synthesis. Compatible with both intermediate diffusion layers and the VAE latent space, the approach significantly enhances global coherence and local realism, achieving state-of-the-art performance.

📝 Abstract

Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.

Problem

Research questions and friction points this paper is trying to address.

ultra-high-resolution image synthesis

representation alignment

Latent Diffusion Models

feature distillation

learnability-fidelity conflict

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Gram Alignment

Latent Diffusion Models

Representation Alignment