Opto-ViT: Architecting a Near-Sensor Region of Interest-Aware Vision Transformer Accelerator with Silicon Photonics

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the computational intensity and low energy efficiency of Vision Transformers (ViTs) in edge vision applications, this work proposes the first near-sensor, region-aware ViT acceleration architecture integrated with silicon photonics. Methodologically: (i) a lightweight mask generation network dynamically prunes non-salient image regions; (ii) optical matrix multiplication is implemented using vertical-cavity surface-emitting lasers (VCSELs) and microring resonators, while electronic circuits handle nonlinearities and normalization; (iii) quantization-aware training, low-rank matrix decomposition, and region-of-interest identification are co-optimized under photonic hardware constraints. Experimental results demonstrate an energy efficiency of 100.4 KFPS/W across classification, object detection, and video understanding tasks—achieving an 84% reduction in energy consumption with less than 1.6% accuracy degradation. The architecture significantly enhances real-time inference capability and scalability of ViTs at the edge.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have emerged as a powerful architecture for computer vision tasks due to their ability to model long-range dependencies and global contextual relationships. However, their substantial compute and memory demands hinder efficient deployment in scenarios with strict energy and bandwidth limitations. In this work, we propose OptoViT, the first near-sensor, region-aware ViT accelerator leveraging silicon photonics (SiPh) for real-time and energy-efficient vision processing. Opto-ViT features a hybrid electronic-photonic architecture, where the optical core handles compute-intensive matrix multiplications using Vertical-Cavity Surface-Emitting Lasers (VCSELs) and Microring Resonators (MRs), while nonlinear functions and normalization are executed electronically. To reduce redundant computation and patch processing, we introduce a lightweight Mask Generation Network (MGNet) that identifies regions of interest in the current frame and prunes irrelevant patches before ViT encoding. We further co-optimize the ViT backbone using quantization-aware training and matrix decomposition tailored for photonic constraints. Experiments across device fabrication, circuit and architecture co-design, to classification, detection, and video tasks demonstrate that OptoViT achieves 100.4 KFPS/W with up to 84% energy savings with less than 1.6% accuracy loss, while enabling scalable and efficient ViT deployment at the edge.

Problem

Research questions and friction points this paper is trying to address.

Reducing energy and bandwidth demands for Vision Transformers

Enabling real-time vision processing with silicon photonics

Optimizing ViT for edge deployment with minimal accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid electronic-photonic architecture for ViT acceleration

Lightweight MGNet for region of interest detection

Quantization-aware training for photonic constraints optimization

🔎 Similar Papers

Streamlined optical training of large-scale modern deep learning architectures with direct feedback alignment