🤖 AI Summary
Deploying Vision Transformers (ViTs) on near-sensor silicon photonic accelerators is highly susceptible to hardware-induced noise and energy efficiency constraints, often resulting in significant accuracy degradation. This work presents the first ViT deployment framework tailored to real-world silicon photonic hardware, achieving a hardware-algorithm co-optimization through several key innovations: empirical noise modeling based on measured microring resonator arrays, an activation-dependent variance proxy model, chance-constrained training (CCT), and a noise-aware LayerNorm design. Notably, the proposed approach recovers near-ideal, noise-free accuracy on actual photonic hardware without requiring on-chip learning or additional optical components, while simultaneously adhering to stringent system energy budgets.
📝 Abstract
Deploying Vision Transformers (ViTs) on near-sensor analog accelerators demands training pipelines that are explicitly aligned with device-level noise and energy constraints. We introduce a compact framework for silicon-photonic execution of ViTs that integrates measured hardware noise, robust attention training, and an energy-aware processing flow. We first characterize bank-level noise in microring-resonator (MR) arrays, including fabrication variation, thermal drift, and amplitude noise, and convert these measurements into closed-form, activation-dependent variance proxies for attention logits and feed-forward activations. Using these proxies, we develop Chance-Constrained Training (CCT), which enforces variance-normalized logit margins to bound attention rank flips, and a noise-aware LayerNorm that stabilizes feature statistics without changing the optical schedule. These components yield a practical ``measure $\rightarrow$ model $\rightarrow$ train $\rightarrow$ run'' pipeline that optimizes accuracy under noise while respecting system energy limits. Hardware-in-the-loop experiments with MR photonic banks show that our approach restores near-clean accuracy under realistic noise budgets, with no in-situ learning or additional optical MACs.