Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In text-to-image (T2I) diffusion models, the text encoder consumes up to 8× more memory than the denoising module—constituting a critical inference deployment bottleneck—despite its negligible computational cost. To address this, we propose Skrr, the first layer-wise sparsification method for text encoders in T2I generation. Skrr dynamically skips and reuses redundant Transformer blocks during a single forward pass, enabling fine-grained, task-aware layer pruning. Its lightweight, architecture-agnostic scheduler is designed via structural analysis and requires no retraining, enabling plug-and-play integration with mainstream models such as SDXL. Extensive evaluation shows Skrr preserves original performance across multiple metrics—including FID, CLIP-Score, DreamSim, and GenEval—while achieving up to 8× memory compression. This significantly outperforms existing pruning approaches, establishing a new state-of-the-art in efficient T2I inference.

Technology Category

Application Category

📝 Abstract
Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory usage in text encoders
Maintains image quality in T2I models
Improves efficiency in transformer blocks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Skip and Re-use layers
Memory efficient text-to-image
Transformer blocks redundancy exploitation
🔎 Similar Papers
No similar papers found.
H
Hoigi Seo
Dept. of Electrical and Computer Engineering, Seoul National University, Republic of Korea
W
Wongi Jeong
Dept. of Electrical and Computer Engineering, Seoul National University, Republic of Korea
Jae-sun Seo
Jae-sun Seo
Cornell Tech
VLSI / ASICDigital/Mixed-Signal CircuitsFPGAML Hardware DesignNeuromorphic Computing
Se Young Chun
Se Young Chun
Department of Electrical and Computer Engineering, Seoul National University
computational imagingmachine learningsignal processingmultimodal processing