🤖 AI Summary
To address the challenge of simultaneously achieving real-time inference and energy efficiency for deep learning models on edge devices, this paper proposes a hardware-aware, fine-grained, dynamically configurable early-exit mechanism. It enables runtime adaptation—selecting optimal exit layers based on instantaneous resource conditions (e.g., latency, power consumption)—to realize deployment-free, real-time model compression and scaling. Unlike static models or monolithic dynamic inference approaches, our method introduces a multi-exit network architecture integrating gradient-sensitivity-driven exit placement, lightweight gating controllers, and an online resource-feedback scheduling algorithm. This establishes the first dynamic configuration paradigm jointly optimizing inference latency, accuracy, and energy consumption. Evaluated on ImageNet, our approach achieves up to 3.2× speedup and 58% energy reduction, with accuracy degradation under 0.8%, while enabling millisecond-level exit-policy switching.