ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

📅 2022-10-18

🏛️ International Symposium on High-Performance Computer Architecture

📈 Citations: 57

✨ Influential: 4

career value

189K/year

🤖 AI Summary

Deploying Vision Transformers (ViTs) on resource-constrained platforms faces critical bottlenecks: prohibitively high computational overhead from self-attention and excessive off-chip data movement, compounded by the inapplicability of existing NLP accelerators. This work proposes a hardware–software co-design acceleration framework tailored to ViT characteristics. First, it introduces *fixed-pattern sparsification* and *polarized attention*, leveraging ViT’s stable token count and static prunability of attention maps. Second, it employs a lightweight learnable autoencoder to replace costly off-chip data transfers with low-overhead on-chip computation. Third, it designs a dedicated accelerator supporting unified scheduling of both dense and sparse attention, integrated with customized on-chip encode/decode engines. At 90% attention sparsity, the framework achieves 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× speedup over CPU, EdgeGPU, GPU, SpAtten, and Sanger, respectively—significantly alleviating the data-movement–dominated energy-efficiency bottleneck.

📝 Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs’ self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency and more extensive applications to resource constrained platforms. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and Transformers for natural language processing (NLP) tasks: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns, without severely hurting the model accuracy (e.g., <=1.5% under 90% pruning ratio); while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the aforementioned enforced denser and sparser workloads for boosted hardware utilization, while integrating on-chip encoder and decoder engines to leverage ViTCoD’s algorithm pipeline for much reduced data movements. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively. Our code implementation is available at https://github.com/GATECH-EIC/ViTCoD.

Problem

Research questions and friction points this paper is trying to address.

Optimize Vision Transformers' hardware efficiency by addressing self-attention bottlenecks.

Develop a co-design framework for algorithm and accelerator to reduce computational costs.

Enhance data movement efficiency in Vision Transformers through fixed sparse attention patterns.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-design algorithm and accelerator for ViTs

Prune and polarize attention maps for efficiency

Integrate lightweight auto-encoder for cost reduction

🔎 Similar Papers

No similar papers found.