Subobject-level Image Tokenization

📅 2024-02-22
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing patch-based image tokenization methods neglect visual morphological structure, limiting models’ efficiency and effectiveness in image understanding. To address this, we propose a sub-object-level adaptive tokenization method that integrates lightweight boundary detection with Watershed segmentation, yielding the first human-cognition-aligned, pixel-complete tokenizer—EPOC—that supports both object- and part-level representations. EPOC synergizes superpixels, Segment Anything Model (SAM), and arbitrarily shaped token embedding to generate morphologically consistent and semantically unambiguous visual tokens. Experiments demonstrate that EPOC achieves segmentation accuracy approaching human annotation across five benchmark datasets. When integrated into vision-language models (VLMs), it accelerates training convergence, enhances generalization, and significantly reduces token count—by up to 60%—without compromising performance. This work establishes a new paradigm for structure-aware, cognitively grounded visual tokenization.

Technology Category

Application Category

📝 Abstract
Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
Problem

Research questions and friction points this paper is trying to address.

Improves image tokenization by focusing on subobject-level segmentation.
Introduces EPOC tokenizer combining boundary detection and watershed segmentation.
Enhances visual language models with faster convergence and better generalization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Subobject-level adaptive token segmentation introduced
EPOC combines boundary detection with watershed segmentation
Token embedding designed for arbitrary-shaped tokens
🔎 Similar Papers
No similar papers found.