DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the trade-off between limited spatial reconstruction capability and generation quality in existing representation autoencoders when using frozen vision foundation models. The authors propose DecQ, a framework that employs lightweight detail-condensing queries to extract fine-grained information from intermediate features of a frozen DINOv2 model and fuses it into the decoder. This approach jointly optimizes reconstruction and generation while preserving the pretrained semantic space. With only eight learnable queries and a 3.9% increase in computational overhead, DecQ improves PSNR from 19.13 dB to 22.76 dB and accelerates convergence of latent diffusion models by 3.3×, achieving FID scores of 1.41 (unconditional) and 1.05 (guided).
📝 Abstract
Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.
Problem

Research questions and friction points this paper is trying to address.

Representation Autoencoders
Vision Foundation Models
Reconstruction-Generation Trade-off
Fine-grained Generation
Latent Diffusion Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detail-Condensing Queries
Representation Autoencoders
Frozen Vision Foundation Models
Reconstruction–Generation Trade-off
Latent Diffusion Models
🔎 Similar Papers