One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the representation mismatch between high-dimensional semantic features from pretrained vision encoders and the low-dimensional latent space required for image generation, this paper proposes the Feature Adaptation Encoder (FAE) framework. FAE employs a single-layer attention mechanism coupled with a decoupled dual-depth decoder to losslessly map features from self-supervised encoders—such as DINO and SigLIP—into a generative latent space, establishing a differentiable bridge between reconstruction fidelity and semantic understanding. The architecture is end-to-end compatible with diffusion models and normalizing flows, enabling plug-and-play integration of arbitrary encoders. On ImageNet 256×256, FAE achieves a competitive FID of 1.48 without classifier-free guidance (CFG) after 800 epochs, improving to 1.29 FID with CFG—approaching state-of-the-art performance—while significantly accelerating training convergence.

Technology Category

Application Category

📝 Abstract
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
Problem

Research questions and friction points this paper is trying to address.

Adapts pre-trained visual encoders for generative tasks
Simplifies complex adaptation with minimal architectural changes
Enables high-quality image generation across different model types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts pretrained visual encoders with a single attention layer
Uses two separate decoders for reconstruction and generation
Works with various self-supervised encoders and generative models
🔎 Similar Papers
No similar papers found.