Opto-ViT: Architecting a Near-Sensor Region of Interest-Aware Vision Transformer Accelerator with Silicon Photonics

๐Ÿ“… 2025-07-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the computational intensity and low energy efficiency of Vision Transformers (ViTs) in edge vision applications, this work proposes the first near-sensor, region-aware ViT acceleration architecture integrated with silicon photonics. Methodologically: (i) a lightweight mask generation network dynamically prunes non-salient image regions; (ii) optical matrix multiplication is implemented using vertical-cavity surface-emitting lasers (VCSELs) and microring resonators, while electronic circuits handle nonlinearities and normalization; (iii) quantization-aware training, low-rank matrix decomposition, and region-of-interest identification are co-optimized under photonic hardware constraints. Experimental results demonstrate an energy efficiency of 100.4 KFPS/W across classification, object detection, and video understanding tasksโ€”achieving an 84% reduction in energy consumption with less than 1.6% accuracy degradation. The architecture significantly enhances real-time inference capability and scalability of ViTs at the edge.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision Transformers (ViTs) have emerged as a powerful architecture for computer vision tasks due to their ability to model long-range dependencies and global contextual relationships. However, their substantial compute and memory demands hinder efficient deployment in scenarios with strict energy and bandwidth limitations. In this work, we propose OptoViT, the first near-sensor, region-aware ViT accelerator leveraging silicon photonics (SiPh) for real-time and energy-efficient vision processing. Opto-ViT features a hybrid electronic-photonic architecture, where the optical core handles compute-intensive matrix multiplications using Vertical-Cavity Surface-Emitting Lasers (VCSELs) and Microring Resonators (MRs), while nonlinear functions and normalization are executed electronically. To reduce redundant computation and patch processing, we introduce a lightweight Mask Generation Network (MGNet) that identifies regions of interest in the current frame and prunes irrelevant patches before ViT encoding. We further co-optimize the ViT backbone using quantization-aware training and matrix decomposition tailored for photonic constraints. Experiments across device fabrication, circuit and architecture co-design, to classification, detection, and video tasks demonstrate that OptoViT achieves 100.4 KFPS/W with up to 84% energy savings with less than 1.6% accuracy loss, while enabling scalable and efficient ViT deployment at the edge.
Problem

Research questions and friction points this paper is trying to address.

Reducing energy and bandwidth demands for Vision Transformers
Enabling real-time vision processing with silicon photonics
Optimizing ViT for edge deployment with minimal accuracy loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid electronic-photonic architecture for ViT acceleration
Lightweight MGNet for region of interest detection
Quantization-aware training for photonic constraints optimization
๐Ÿ”Ž Similar Papers
No similar papers found.