Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional 2D latent representations suffer from spatial redundancy and high training costs. To address this, we propose RepTok: the first framework that directly adopts continuous semantic tokens—output by self-supervised vision transformers—as compact, one-dimensional latent representations, eliminating the conventional 2D grid structure. Methodologically, RepTok fine-tunes these semantic tokens while jointly training a flow-matching decoder, incorporating a cosine similarity loss to preserve geometric consistency and smoothness inherited from the SSL-pretrained embedding space. This design drastically reduces parameter count and training overhead while maintaining high-fidelity reconstruction. Experiments demonstrate that RepTok achieves state-of-the-art performance in ImageNet class-conditional generation and enables zero-shot transfer to MS-COCO text-to-image synthesis with minimal computational budget—validating its strong generalization capability and efficiency.

Technology Category

Application Category

📝 Abstract
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
Problem

Research questions and friction points this paper is trying to address.

Adapting self-supervised representations for efficient image generation
Resolving spatial redundancies in 2D latent spaces
Enabling competitive generation under limited training budgets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-supervised vision transformer tokens
Fine-tunes semantic embeddings with flow matching
Regularizes latent space with cosine similarity loss
🔎 Similar Papers
No similar papers found.