Token Turing Machines are Efficient Vision Models

📅 2024-09-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the high computational cost and absence of explicit memory mechanisms in Vision Transformers (ViTs) for visual tasks, this paper proposes the Vision Token Turing Machine (ViTTM)—the first Token Turing Machine architecture tailored for non-sequential vision tasks such as image classification and semantic segmentation. Its core innovation lies in a decoupled dual-token design: processing tokens specialize in feature extraction, while memory tokens enable dynamic cross-layer read/write operations and information reuse. Integrated with a learnable memory module and a lightweight cross-layer gating mechanism, ViTTM enhances model capacity and generalization without increasing sequence length. Experiments demonstrate that ViTTM-B achieves 82.9% Top-1 accuracy on ImageNet-1K with only 234.1 ms inference latency—56% faster than ViT-B. On ADE20K semantic segmentation, it attains 45.17 mIoU and 26.8 FPS, representing a 94% speedup over baseline methods.

Technology Category

Application Category

📝 Abstract

We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).

Problem

Research questions and friction points this paper is trying to address.

Image Processing

Recognition Efficiency

Accuracy Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Token Turing Machine

Image Classification

Image Segmentation

🔎 Similar Papers

No similar papers found.