🤖 AI Summary
To address the representation mismatch between high-dimensional semantic features from pretrained vision encoders and the low-dimensional latent space required for image generation, this paper proposes the Feature Adaptation Encoder (FAE) framework. FAE employs a single-layer attention mechanism coupled with a decoupled dual-depth decoder to losslessly map features from self-supervised encoders—such as DINO and SigLIP—into a generative latent space, establishing a differentiable bridge between reconstruction fidelity and semantic understanding. The architecture is end-to-end compatible with diffusion models and normalizing flows, enabling plug-and-play integration of arbitrary encoders. On ImageNet 256×256, FAE achieves a competitive FID of 1.48 without classifier-free guidance (CFG) after 800 epochs, improving to 1.29 FID with CFG—approaching state-of-the-art performance—while significantly accelerating training convergence.
📝 Abstract
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.