HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

๐Ÿ“… 2025-09-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing vision transformers (ViTs) employed as visual tokenizers suffer from single-scale modeling, hindering effective cross-scale information transferโ€”from low-resolution semantics to high-resolution structures. To address this, we propose HieraTok, the first multi-scale ViT-based visual tokenizer. It introduces hierarchical token representations via multi-scale downsampling, scale-causal attention, and layered feature fusion, enabling progressive cross-scale information flow. Methodologically, HieraTok is the first to integrate multi-scale ViT architecture into tokenizer design, significantly improving latent-space distribution and representational capacity. Experiments demonstrate substantial gains: under identical settings, it reduces rFID by 27.2% and improves gFID by 18.9%, while accelerating convergence by 1.38ร—. After large-scale training, HieraTok achieves state-of-the-art performance with rFID = 0.45 and gFID = 1.82.

Technology Category

Application Category

๐Ÿ“ Abstract
In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2% improvement in rFID ($1.47 ightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38 imes$ faster convergence rate and an 18.9% boost in gFID ($16.4 ightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Overcoming single-scale representation limitations in visual tokenizers
Enabling progressive information flow from global to detailed features
Improving image reconstruction and generation performance significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale tokenizer overcomes single-scale representation limitations
Scale-causal attention enables progressive information flow
Multi-scale downsampling produces hierarchical visual tokens
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Cong Chen
Zhejiang University
Z
Ziyuan Huang
Ant Group
C
Cheng Zou
Ant Group
Muzhi Zhu
Muzhi Zhu
Zhejiang University
Computer VisionMachine Learning
Kaixiang Ji
Kaixiang Ji
Ant Group
Computer VisionMultimodal
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
J
Jingdong Chen
Ant Group
H
Hao Chen
Zhejiang University
Chunhua Shen
Chunhua Shen
Zhejiang University
Computer VisionMachine Learning