From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from insufficient cross-modal alignment between vision and language. To address this, we propose the first end-to-end differentiable, image-centric Byte-Pair Encoding (BPE) tokenizer for vision—eliminating conventional dual-tower architectures and separate visual encoders. Instead, our method directly applies an enhanced BPE algorithm to quantized visual features, enabling a unified pixel-to-semantic token mapping. Crucially, we explicitly inject local structural priors to enforce geometric consistency during tokenization. This work pioneers the adaptation of the BPE paradigm to visual tokenization, achieving native alignment between image and text token spaces. Evaluated on few-shot multimodal understanding tasks, our approach yields substantial performance gains. The resulting model, Being-VL-0, surpasses state-of-the-art methods across multiple benchmarks while demonstrating strong scalability.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs' multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop Being-VL-0, a model that demonstrates superior performance across various benchmarks and shows promising scalability, potentially paving the way for more efficient and capable multimodal foundation models.

Problem

Research questions and friction points this paper is trying to address.

Aligning visual and textual modalities effectively in MLLMs.

Incorporating structural prior information into image tokens.

Enhancing multimodal understanding with limited training data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Byte-Pair encoding applied to visual data

Structural prior integrated into image tokens

Enhanced multimodal understanding with limited data

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM