🤖 AI Summary
This study investigates whether spike sparsity can effectively reduce latency and energy consumption in edge GPU deployments. Leveraging the Darcy rectangle benchmark on a Jetson Orin Nano platform, we compare the deployment performance of Variable Spiking Wavelet Neural Operators (VS-WNO) against dense WNOs, complemented by runtime overhead analysis using Nsight Systems. Our experiments provide the first empirical evidence that, despite its algorithmic sparsity, VS-WNO exhibits significantly higher latency (59.6 ms) and energy consumption (228.0 mJ) compared to dense WNO (53.2 ms, 180.7 mJ) under current commercial edge GPU software stacks. These findings reveal a critical bottleneck: algorithmic sparsity does not readily translate into practical deployment benefits on existing hardware-software infrastructures.
📝 Abstract
Spiking neural operators are appealing for neuromorphic edge computing because event-driven substrates can, in principle, translate sparse activity into lower latency and energy. Whether that advantage survives deployment on commodity edge-GPU software stacks, however, remains unclear. We study this question on a Jetson Orin Nano 8 GB using five pretrained variable-spiking wavelet neural operator (VS-WNO) checkpoints and five matched dense wavelet neural operator (WNO) checkpoints on the Darcy rectangular benchmark. On a reference-aligned path, VS-WNO exhibits substantial algorithmic sparsity, with mean spike rates decreasing from 54.26% at the first spiking layer to 18.15% at the fourth. On a deployment-style request path, however, this sparsity does not reduce deployed cost: VS-WNO reaches 59.6 ms latency and 228.0 mJ dynamic energy per inference, whereas dense WNO reaches 53.2 ms and 180.7 mJ, while also achieving slightly lower reference-path error (1.77% versus 1.81%). Nsight Systems indicates that the request path remains launch-dominated and dense rather than sparsity-aware: for VS-WNO, cudaLaunchKernel accounts for 81.6% of CUDA API time within the latency window, and dense convolution kernels account for 53.8% of GPU kernel time; dense WNO shows the same pattern. On this Jetson-class GPU stack, spike sparsity is measurable but does not reduce deployed cost because the runtime does not suppress dense work as spike activity decreases.