Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenizatio

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
This work addresses the limitation of existing visual tokenizers, which rely solely on the encoder’s final-layer features and thereby discard rich hierarchical information embedded in intermediate layers, leading to severe loss of low-level details during semantic abstraction. To overcome this, the authors propose DRoRAE, a novel framework that employs a lightweight multi-layer feature fusion module combined with energy-constrained routing and incremental correction mechanisms to adaptively aggregate features across all encoder layers, yielding more expressive latent representations. A three-stage decoupled training strategy is introduced to jointly optimize reconstruction and generation performance. The study systematically demonstrates, for the first time, the significant benefits of multi-layer fusion for visual token expressiveness, revealing a log-linear scaling law (R²=0.86) between representation richness and reconstruction quality, thereby establishing representation richness as a predictable axis for model scaling. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65, while also achieving notable transfer gains in text-to-image synthesis.
📝 Abstract
Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.
Problem

Research questions and friction points this paper is trying to address.

visual tokenization
multi-layer representation
representation autoencoder
hierarchical features
information loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-layer fusion
visual tokenization
representation richness
depth-routed autoencoder
scaling law
🔎 Similar Papers