End-to-End Vision Tokenizer Tuning

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision tokenization methods decouple tokenizer training from downstream tasks, leading to semantic misalignment between visual tokens and task objectives—constituting a representational bottleneck for multimodal understanding and generation. This paper proposes the first end-to-end framework that jointly optimizes a vision tokenizer with multimodal autoregressive tasks. Our approach enables differentiable vision embedding for co-training the tokenizer, unifies image reconstruction and caption generation into a dual-objective loss, and introduces a lightweight training ensembling mechanism. Crucially, it requires no modification to the large language model’s vocabulary or architecture, preserving native reconstruction capability while enhancing task-specific adaptation. On multimodal understanding and visual generation benchmarks, our method outperforms frozen-tokenizer baselines by 2–6%, while fully retaining image reconstruction performance.

Technology Category

Application Category

📝 Abstract
Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.
Problem

Research questions and friction points this paper is trying to address.

Vision tokenizers lack task-specific optimization alignment
Decoupled tokenization causes representation bottlenecks in tasks
Frozen tokenizers limit multimodal performance gains
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end joint vision tokenizer and task optimization
Leverages visual embeddings for tokenizer codebook tuning
Seamless integration with minimal architecture modifications
🔎 Similar Papers
No similar papers found.
W
Wenxuan Wang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence
F
Fan Zhang
Beijing Academy of Artificial Intelligence
Yufeng Cui
Yufeng Cui
Beijing Academy of Artificial Intelligence
Generative AIMultimodal AIFoundation Models
Haiwen Diao
Haiwen Diao
Nanyang Technological University
Computer VisionVision-and-LanguageTransfer LearningMultimodal LLM
Zhuoyan Luo
Zhuoyan Luo
Tsinghua University
AI
H
Huchuan Lu
Dalian University of Technology
J
Jing Liu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
X
Xinlong Wang
Beijing Academy of Artificial Intelligence