🤖 AI Summary
This work addresses the deployment challenges of Vision Transformers (ViTs) in resource-constrained semiconductor package defect inspection, where high computational cost, memory footprint, and energy consumption hinder practical adoption. To overcome these limitations, we propose the first end-to-end efficient ViT deployment framework that simultaneously co-optimizes architecture, token count, and bit-width. Our approach integrates AutoFormer-based neural architecture search to select a compact backbone, employs the Token Merging (ToMe) algorithm to dynamically fuse redundant tokens, and leverages fp16 mixed-precision inference for accelerated computation. Compared to the DeiT-B/16 baseline, our method achieves over a 10× throughput improvement while reducing parameters, FLOPs, and energy consumption by more than 90%, all without compromising industrial-grade detection accuracy—thereby surpassing the performance ceilings of single-axis optimization strategies.
📝 Abstract
Vision Transformers (ViTs) have achieved strong performance in visual recognition, yet their deployment in resource-constrained industrial environments remains limited. Some main challenges are their high computational cost, memory requirement, and energy consumption. While individual efficiency techniques such as neural architecture search (NAS), token compression, and low-precision inference have been extensively studied, most prior work targets only a single optimization axis, limiting overall deployment gains while preserving accuracy. In this paper, we present one of the first holistic frameworks that jointly optimizes three complementary axes: architecture, token, and bit-width. Specifically, the framework identifies compact backbones via Neural Architecture Search (AutoFormer), reduces information processing via token merging (ToMe), and accelerates per-operation execution via fp16 mixed-precision inference. Starting from a DeiT-B/16 baseline, we first analyze accuracy-efficiency trade-offs on ImageNet-1K under aggressive compression. Then, we apply the selected configurations to a real-world in-house 3D X-ray semiconductor defect classification dataset for IC chip packaging inspection. Results show that the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task. To the best of our knowledge, this is among the earliest works to jointly optimize architecture, token, and bit-width dimensions in ViTs and the first such resource-efficient, deployment-focused study tailored to semiconductor manufacturing.