🤖 AI Summary
This paper addresses the low fidelity of visual tokenizers and VAEs in reconstructing fine-grained content—such as text and faces—by introducing TokBench, a lightweight and efficient benchmark. Methodologically, it proposes, for the first time, a dual-dimensional quantitative evaluation framework based on OCR accuracy and face feature similarity, replacing insensitive traditional metrics (e.g., PSNR/SSIM); it further designs a semantic-aware, low-overhead assessment protocol (2 GB memory, 4 minutes per model). Contributions include: (1) systematically revealing that mainstream models severely degrade text structure and facial identity at small token scales; (2) empirically demonstrating a significant misalignment between conventional metrics and human perceptual judgment; and (3) pioneering the extension of tokenizer evaluation to video tokenization. TokBench provides a reproducible, interpretable, and standardized evaluation tool for fine-grained visual representation learning.
📝 Abstract
In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Image tokenization has significantly advanced visual generation and multimodal modeling, particularly with autoregressive models due to the modeling simplicity of discrete tokens. Autoregressive models typically rely on image tokenizers to compress images into discrete tokens for sequential prediction, whereas diffusion models often operate on continuous latent space to reduce computational costs. However, both visual compression approaches inevitably lose visual information, thereby limiting the upper bound of visual generation quality. To evaluate how these compression losses affect text and faces, the most human-sensitive visual elements, we first collect and curate a collection of text and faces images from existing datasets, ensuring clarity and diversity. For text reconstruction, we employ OCR models to assess the recognition accuracy of the reconstructed text, and then we measure feature similarity between original and reconstructed faces thereby quantifying faces reconstruction fidelity. Our method is highly lightweight, requiring just 2GB memory and 4 minutes to complete evaluations. With our benchmark, we analyze the reconstruction quality of text and faces at various scales across different image tokenizers and VAEs. Our results demonstrate that modern visual tokenizers still struggle to preserve fine-grained features, particularly at smaller scales. Furthermore, we extend this evaluation framework to the video, conducting a comprehensive analysis of video tokenizers. Additionally, we find that traditional metrics fail to accurately reflect the reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.