TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of latent representation collapse in autoencoders under high compression ratios, which severely degrades both reconstruction fidelity and generative capability. To mitigate this, the authors propose a two-stage token-to-latent compression mechanism grounded in the Vision Transformer architecture, which effectively preserves structural information even with a fixed latent dimensionality. Additionally, a joint self-supervised training strategy is introduced to enhance semantic consistency among image tokens, thereby alleviating latent collapse. The method significantly improves image reconstruction quality and generative performance under extreme compression, with its core innovation lying in the synergistic design of token count scaling and semantic-aware compression.
📝 Abstract
We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.
Problem

Research questions and friction points this paper is trying to address.

latent representation collapse
deep compression autoencoders
token-to-latent compression
generative performance
ViT-based architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

token compression
ViT-based autoencoder
latent representation collapse
self-supervised training
deep image compression
🔎 Similar Papers
No similar papers found.