Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

📅 2024-05-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Vision Transformers (ViTs) rely on uniform image patching, yielding tokens with limited semantic meaning and compositional structure. Method: We propose a semantics-driven visual tokenization framework that replaces fixed-size patches with “tangible tokens” derived from instance segmentation masks and “intangible tokens” extracted from scene graphs to encode relationships and actions. We introduce an additive attention mechanism to explicitly model structural and semantic dependencies among tokens, and design an end-to-end vision-language pretraining framework enabling fine-grained alignment between visual token embeddings and textual caption embeddings. Contribution/Results: This work is the first to systematically incorporate interpretable, semantically grounded visual tokens. On COCO image-text retrieval, it improves text-to-image and image-to-text recall by 47% and 44%, respectively. On compositional reasoning benchmarks—ARO and Winoground—it achieves gains of 18% and 10%, significantly enhancing semantic understanding and compositional generalization.

Technology Category

Application Category

📝 Abstract

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).

Problem

Research questions and friction points this paper is trying to address.

Improving visual representation learning with semantic tokens

Enhancing vision transformers via meaningful segmentation masks

Boosting compositionality in vision-language pre-training frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses semantically-meaningful visual tokens

Leverages segmentation and scene-graph models

Introduces additive attention weights

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM