๐ค AI Summary
To address the computational intensity and low energy efficiency of Vision Transformers (ViTs) in edge vision applications, this work proposes the first near-sensor, region-aware ViT acceleration architecture integrated with silicon photonics. Methodologically: (i) a lightweight mask generation network dynamically prunes non-salient image regions; (ii) optical matrix multiplication is implemented using vertical-cavity surface-emitting lasers (VCSELs) and microring resonators, while electronic circuits handle nonlinearities and normalization; (iii) quantization-aware training, low-rank matrix decomposition, and region-of-interest identification are co-optimized under photonic hardware constraints. Experimental results demonstrate an energy efficiency of 100.4 KFPS/W across classification, object detection, and video understanding tasksโachieving an 84% reduction in energy consumption with less than 1.6% accuracy degradation. The architecture significantly enhances real-time inference capability and scalability of ViTs at the edge.
๐ Abstract
Vision Transformers (ViTs) have emerged as a powerful architecture for computer vision tasks due to their ability to model long-range dependencies and global contextual relationships. However, their substantial compute and memory demands hinder efficient deployment in scenarios with strict energy and bandwidth limitations. In this work, we propose OptoViT, the first near-sensor, region-aware ViT accelerator leveraging silicon photonics (SiPh) for real-time and energy-efficient vision processing. Opto-ViT features a hybrid electronic-photonic architecture, where the optical core handles compute-intensive matrix multiplications using Vertical-Cavity Surface-Emitting Lasers (VCSELs) and Microring Resonators (MRs), while nonlinear functions and normalization are executed electronically. To reduce redundant computation and patch processing, we introduce a lightweight Mask Generation Network (MGNet) that identifies regions of interest in the current frame and prunes irrelevant patches before ViT encoding. We further co-optimize the ViT backbone using quantization-aware training and matrix decomposition tailored for photonic constraints. Experiments across device fabrication, circuit and architecture co-design, to classification, detection, and video tasks demonstrate that OptoViT achieves 100.4 KFPS/W with up to 84% energy savings with less than 1.6% accuracy loss, while enabling scalable and efficient ViT deployment at the edge.