Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work proposes CompTok, a framework designed to achieve fine-grained semantic control and enhance model learnability in image generation. By integrating a conditional diffusion decoder with an InfoGAN-inspired recognition objective, CompTok enforces the effective utilization of all visual tokens. During training, subsets of tokens from different images are swapped, and unlabeled adversarial manifold regularization is introduced to preserve generation fidelity, thereby improving token compositionality and controllability. The approach innovatively combines token swapping with manifold constraints and introduces two novel generator-free metrics to evaluate compositionality and learnability in the token space. CompTok achieves state-of-the-art performance on class-conditional image generation, enables high-level semantic editing—such as cross-image token swapping—and significantly outperforms existing methods on the proposed evaluation metrics.

Technology Category

Application Category

📝 Abstract

We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.

Problem

Research questions and friction points this paper is trying to address.

visual tokenization

compositionality

learnability

semantic editing

image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

composable tokenization

generator-free diagnostics

token-conditioned diffusion