FTerViT: Fully Ternary Vision Transformer

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work proposes the first fully ternarized Vision Transformer, addressing the limitation of existing ternary ViTs that only quantize encoder layers while retaining full-precision components such as Patch Embedding and LayerNorm, which hinder efficient deployment on resource-constrained devices. The proposed model ternarizes all weights and normalization parameters, introducing two novel operators—TernaryBitConv2d and TernaryLayerNorm—to enable end-to-end ternary computation. Combined with knowledge distillation and quantization-aware restoration training, the method achieves approximately 15× model compression under a W2A8 configuration, attaining 82.43% top-1 accuracy on ImageNet-1K (384×384). Furthermore, it successfully deploys on an ESP32-S3 microcontroller at 224×224 resolution with 79.64% accuracy, demonstrating practical feasibility for edge applications.

📝 Abstract

Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384$\times$384 resolution achieves 82.43\% ImageNet-1K top-1 at 6.09\,MB (${\sim}$15$\times$ compression, $-$2.42\,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224$\times$224 resolution, 5.81\,MB), we achieve 79.64\% ImageNet-1K top-1 accuracy.

Problem

Research questions and friction points this paper is trying to address.

Ternary Vision Transformer

model compression

memory footprint

resource-constrained devices

full-precision components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully Ternary Vision Transformer

TernaryBitConv2d

TernaryLayerNorm