FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

šŸ“… 2025-06-03
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Traditional contrastive language–image pretraining relies on dual-encoder architectures, which inherently lack native support for joint multimodal input and unified representation. To address this limitation, we propose the first discrete-token-based early-fusion paradigm for vision–language modeling: images are encoded via a dedicated discrete tokenizer and then mapped—alongside text tokens—into a shared vocabulary, enabling joint processing within a single-stream Transformer with fine-grained cross-modal interaction at every layer. This design eliminates the dual-encoder plus late-fusion pipeline, thereby avoiding feature misalignment and information loss during fusion. We introduce a purpose-built multimodal pretraining and evaluation dataset and employ contrastive pretraining objectives. Experiments demonstrate substantial improvements over state-of-the-art methods on multimodal benchmarks—including VQA and text-guided image retrieval—while maintaining competitive performance on unimodal downstream tasks.

Technology Category

Application Category

šŸ“ Abstract
Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.
Problem

Research questions and friction points this paper is trying to address.

Encode image and text into a single feature vector
Improve multimodal interaction in early fusion
Enhance performance in VQA and image retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early fusion of text and image tokens
Single transformer for multimodal encoding
Extended vocabulary for interaction
šŸ”Ž Similar Papers
No similar papers found.