VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

๐Ÿ“… 2025-11-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of establishing a unified representation framework for multimodal understanding, generation, and reconstruction. To this end, we propose a vector-quantized autoencoder (VQ-VAE) architecture that jointly models continuous semantic features and discrete generative tokens within a single tokenizer. Our key contributions are threefold: (1) the first unified tokenizer that simultaneously enables semantic understanding and high-fidelity visual generation; (2) a high-dimensional semantic codebook achieving 100% codebook utilization; and (3) a symmetric Vision Transformer (ViT) decoder coupled with a two-stage training strategyโ€”first freezing the encoder to learn the codebook, then jointly optimizing the full model via self-distillation. Leveraging pretrained vision foundation models, our approach achieves state-of-the-art performance across diverse understanding, generation, and reconstruction benchmarks, while balancing fine-grained reconstruction fidelity, generation efficiency, and model scalability.

Technology Category

Application Category

๐Ÿ“ Abstract
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding, generation and reconstruction in single tokenizer
Creating continuous semantic features for understanding and discrete tokens for generation
Developing unified representation that maintains multimodal capabilities while enabling reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector Quantization Autoencoder for unified multimodal representation
Two-stage training with semantic VQ codebook and self-distillation
High-dimensional discrete tokens enabling understanding and generation