🤖 AI Summary
This work addresses the critical challenge of misalignment between visual tokenization and the vocabulary space of large language models (LLMs), which hinders autoregressive image generation. To bridge this gap, we propose V2Flow—a novel vision tokenizer featuring a vision vocabulary resampler and a masked autoregressive correction flow decoder, enabling dual alignment of visual tokens with LLM vocabularies in both structural and distributional dimensions. V2Flow further introduces soft-classification token representations, a masked Transformer encoder-decoder architecture, and rectified-flow-based velocity field modeling to support variable-length sequence generation. Experiments demonstrate that V2Flow achieves superior reconstruction fidelity compared to state-of-the-art VQ tokenizers and enables high-fidelity image generation across multiple benchmarks. Notably, it is the first method to successfully repurpose open-source LLMs—including LLaMA and Phi-3—for end-to-end autoregressive visual generation without architectural modification.
📝 Abstract
We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow