Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing representation-based encoder generative paradigms face two key challenges: (1) discriminative feature spaces lack compact regularization, causing diffusion sampling to deviate from the data manifold and yield structural distortions; and (2) encoders exhibit weak pixel-level reconstruction capability, limiting geometric and textural fidelity. To address these, we propose a semantic-pixel joint reconstruction objective, achieving—within a compact 16×16, 96-dimensional latent space—the first unified high-semantic and high-fidelity pixel reconstruction. Our method integrates dual reconstruction losses, a compact latent-space design, a unified representation-encoder-based T2I and editing diffusion architecture, and a VAE feature-space adaptation mechanism. Experiments demonstrate significant improvements in reconstruction quality, text-to-image generation, and image editing—achieving state-of-the-art performance—along with accelerated convergence. This validates the feasibility of efficiently transferring understanding-oriented encoders into robust, generative latent spaces.

Technology Category

Application Category

📝 Abstract
Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.
Problem

Research questions and friction points this paper is trying to address.

Adapting discriminative encoders for generative tasks with semantic-pixel regularization
Overcoming off-manifold latents and weak reconstruction in representation encoders
Enabling compact, semantically rich latents for text-to-image generation and editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-pixel reconstruction objective regularizes latent space
Compresses semantic and fine details into compact representation
Unified T2I and editing model using adapted encoder features
🔎 Similar Papers
No similar papers found.