🤖 AI Summary
This work addresses the mismatch between tokenization and generation in existing vector-quantized image generation methods, which forces models to learn from unordered token distributions and compromises output quality and coherence. To resolve this, the authors propose NativeTok, a native visual tokenization approach that explicitly enforces causal dependencies during tokenization, thereby endowing the token sequence with an intrinsic structural order. The key innovations include a Meta Image Transformer for modeling latent representations, a Mixture-of-Causal-Experts Transformer (MoCET) that generates tokens in a sequentially conditioned manner, and a hierarchical native training strategy that updates only newly introduced expert modules to balance efficiency and performance. Experiments demonstrate that NativeTok significantly improves both image reconstruction fidelity and semantic coherence while maintaining computational efficiency.
📝 Abstract
VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each lightweight expert block generates a single token conditioned on prior tokens and latent features. We further design a Hierarchical Native Training strategy that updates only new expert blocks, ensuring training efficiency. Extensive experiments demonstrate the effectiveness of NativeTok.