๐ค AI Summary
Existing vision transformers (ViTs) employed as visual tokenizers suffer from single-scale modeling, hindering effective cross-scale information transferโfrom low-resolution semantics to high-resolution structures. To address this, we propose HieraTok, the first multi-scale ViT-based visual tokenizer. It introduces hierarchical token representations via multi-scale downsampling, scale-causal attention, and layered feature fusion, enabling progressive cross-scale information flow. Methodologically, HieraTok is the first to integrate multi-scale ViT architecture into tokenizer design, significantly improving latent-space distribution and representational capacity. Experiments demonstrate substantial gains: under identical settings, it reduces rFID by 27.2% and improves gFID by 18.9%, while accelerating convergence by 1.38ร. After large-scale training, HieraTok achieves state-of-the-art performance with rFID = 0.45 and gFID = 1.82.
๐ Abstract
In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2% improvement in rFID ($1.47
ightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38 imes$ faster convergence rate and an 18.9% boost in gFID ($16.4
ightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.