DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high sampling overhead, substantial memory consumption, and degraded reconstruction quality in highly compressed latent spaces that hinder the application of diffusion models to image compression. The authors propose DiT-IC, the first adaptation of a multi-step text-to-image Diffusion Transformer (DiT) into a single-step image compression framework, enabling efficient reconstruction in a 32× downsampled latent space. By integrating variance-guided reconstruction flow, self-distillation alignment, and latent conditional guidance, the method significantly enhances both reconstruction fidelity and computational efficiency. DiT-IC achieves state-of-the-art perceptual quality while accelerating decoding speed by up to 30× and substantially reducing memory usage, enabling real-time reconstruction of 2048×2048 images on a 16GB GPU.

Technology Category

Application Category

📝 Abstract
Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.
Problem

Research questions and friction points this paper is trying to address.

diffusion-based image compression
latent space efficiency
sampling overhead
memory usage
perceptual fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer
Image Compression
Latent Space Alignment
Single-step Reconstruction
Perceptual Quality
🔎 Similar Papers
No similar papers found.
Junqi Shi
Junqi Shi
Nanjing University
Video CodingQuantizationImplicit Neural Representation
Ming Lu
Ming Lu
Nanjing University
Image and Video ProcessingData Compression
X
Xingchen Li
School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China
A
Anle Ke
School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China
R
Ruiqi Zhang
School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China
Zhan Ma
Zhan Ma
Vision Lab, Nanjing University
Learning for Video Coding & CommunicationComputational Imaging