One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

📅 2024-10-28
📈 Citations: 14
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the limited interpretability of internal representations in text-to-image diffusion models (e.g., SDXL Turbo). It introduces sparse autoencoders (SAEs) for the first time to interpret single-step denoising activation updates within U-Net transformer blocks. Methodologically, we propose a lightweight, single-step SAE training paradigm that generalizes across denoising timesteps and across models (SDXL Turbo → SDXL Base), and construct RIEBench—the first representation-based image editing benchmark—for systematic functional evaluation of model components. Key contributions include: (1) empirical validation that SAEs learn semantically coherent, causally intervenable intermediate features; (2) discovery of layer-specific responsiveness of transformer blocks to distinct editing tasks; and (3) demonstration of zero-shot, fine-tuning-free cross-model representation transfer and precise editing control via latent intervention.

Technology Category

Application Category

📝 Abstract
For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks' features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.
Problem

Research questions and friction points this paper is trying to address.

Applying sparse autoencoders to interpret SDXL Turbo's features
Generalizing SAEs across different text-to-image diffusion models
Developing RIEBench for feature-based image editing analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders analyze SDXL Turbo features
SAEs generalize across different diffusion models
RIEBench benchmarks feature-based image editing
🔎 Similar Papers
No similar papers found.