HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing vision transformers (ViTs) employed as visual tokenizers suffer from single-scale modeling, hindering effective cross-scale information transfer—from low-resolution semantics to high-resolution structures. To address this, we propose HieraTok, the first multi-scale ViT-based visual tokenizer. It introduces hierarchical token representations via multi-scale downsampling, scale-causal attention, and layered feature fusion, enabling progressive cross-scale information flow. Methodologically, HieraTok is the first to integrate multi-scale ViT architecture into tokenizer design, significantly improving latent-space distribution and representational capacity. Experiments demonstrate substantial gains: under identical settings, it reduces rFID by 27.2% and improves gFID by 18.9%, while accelerating convergence by 1.38×. After large-scale training, HieraTok achieves state-of-the-art performance with rFID = 0.45 and gFID = 1.82.

Technology Category

Application Category

📝 Abstract

In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2% improvement in rFID ($1.47 ightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38 imes$ faster convergence rate and an 18.9% boost in gFID ($16.4 ightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Overcoming single-scale representation limitations in visual tokenizers

Enabling progressive information flow from global to detailed features

Improving image reconstruction and generation performance significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale tokenizer overcomes single-scale representation limitations

Scale-causal attention enables progressive information flow

Multi-scale downsampling produces hierarchical visual tokens

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM