Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

πŸ“… 2024-12-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Standard grid-based patching in Vision Transformers (ViTs) often yields tokens containing mixed semantic content, undermining representation consistency. To address this, we propose a superpixel-driven, semantically consistent tokenization methodβ€”the first to integrate superpixels into ViT token generation. Our approach features a two-stage pipeline: pre-aggregation feature extraction followed by superpixel-aware aggregation. It leverages SLIC superpixel segmentation, region-wise feature pre-aggregation, deformable attention adaptation, and token-level semantic alignment to resolve the compatibility challenge between irregular superpixel regions and the Transformer architecture. The method is backbone-agnostic and plug-and-play, requiring no modifications to the base network. Extensive experiments on ImageNet, CIFAR-100, and adversarial robustness benchmarks demonstrate significant improvements in both accuracy and generalization, empirically validating the critical performance gain conferred by semantically pure tokens in ViTs.

Technology Category

Application Category

πŸ“ Abstract
Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Improving ViT tokenization by using superpixels
Ensuring single visual concept per token
Overcoming superpixel integration challenges in ViT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Superpixel tokenization replaces grid-based tokenization
Pre-aggregate extraction enables superpixel-aware aggregation
Enhances ViT accuracy and robustness in downstream tasks
πŸ”Ž Similar Papers
No similar papers found.
Jaihyun Lew
Jaihyun Lew
Seoul National University
Computer Vision
Soohyuk Jang
Soohyuk Jang
Seoul National Unversity
Machine Learning
J
Jaehoon Lee
Interdisciplinary Program in AI, Seoul National University
S
Seung-Kwun Yoo
Department of Electrical and Computer Engineering, Seoul National University
E
Eunji Kim
Department of Electrical and Computer Engineering, Seoul National University
Saehyung Lee
Saehyung Lee
Seoul National University, Electrical and Computer Engineering
deep learningmachine learning
J
J. Mok
Department of Electrical and Computer Engineering, Seoul National University
S
Siwon Kim
Amazon
Sungroh Yoon
Sungroh Yoon
Professor, Electrical and Computer Engineering & Artificial Intelligence, Seoul National University
AIdeep learningmachine learningon-device AIbioinformatics