🤖 AI Summary
Traditional 2D latent representations suffer from spatial redundancy and high training costs. To address this, we propose RepTok: the first framework that directly adopts continuous semantic tokens—output by self-supervised vision transformers—as compact, one-dimensional latent representations, eliminating the conventional 2D grid structure. Methodologically, RepTok fine-tunes these semantic tokens while jointly training a flow-matching decoder, incorporating a cosine similarity loss to preserve geometric consistency and smoothness inherited from the SSL-pretrained embedding space. This design drastically reduces parameter count and training overhead while maintaining high-fidelity reconstruction. Experiments demonstrate that RepTok achieves state-of-the-art performance in ImageNet class-conditional generation and enables zero-shot transfer to MS-COCO text-to-image synthesis with minimal computational budget—validating its strong generalization capability and efficiency.
📝 Abstract
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.