REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

πŸ“… 2025-12-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Latent Diffusion Models (LDMs) suffer from weak semantic supervision, slow training convergence, and limited generation quality, primarily due to the reconstruction-based denoising objective’s lack of explicit, rich high-level semantic guidance. Method: We propose REGLUE, the first framework jointly modeling VAE latent variables, multi-layer visual foundation model (VFM) spatial features, and global [CLS] tokens. It introduces a lightweight nonlinear convolutional semantic compressor to efficiently fuse multi-scale VFM spatial semantics and incorporates an external alignment loss with frozen VFMs to regularize the latent space. Contribution/Results: On ImageNet 256Γ—256, REGLUE significantly outperforms SiT-B/2, SiT-XL/2, and prior methods (REPA, ReDi, REG), achieving lower FID scores and faster convergence. This validates the effectiveness of co-modeling global, local, and latent representations alongside nonlinear semantic compression.

Technology Category

Application Category

πŸ“ Abstract
Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .
Problem

Research questions and friction points this paper is trying to address.

Improves latent diffusion models by integrating multi-layer VFM semantics
Addresses slow semantic emergence and limited sample quality in LDMs
Unifies VAE latents with global and local VFM features for better synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly models VAE latents, local patch semantics, and global CLS tokens
Uses nonlinear convolutional compressor for multi-layer VFM feature aggregation
Combines internal entanglement with external alignment loss for regularization
πŸ”Ž Similar Papers
No similar papers found.