🤖 AI Summary
Early-exit (EE) deep neural networks reduce inference latency but suffer from overconfidence, causing premature exits on hard instances and undermining prediction reliability. This paper proposes SPEED, the first framework integrating selective prediction with hierarchical latency-aware classifiers. At each intermediate layer, exit decisions are jointly determined by confidence scores and instance difficulty estimates; only easy instances trigger early exits, while hard instances are dynamically routed to deeper expert modules. This mechanism significantly improves both reliability and efficiency without compromising accuracy: SPEED achieves a 2.05× speedup over full-stack inference and reduces erroneous prediction risk by 50%. Its core innovation lies in a difficulty-aware dynamic exit policy—departing from conventional EE methods that rely solely on fixed confidence thresholds—thereby establishing a new paradigm for trustworthy AI deployment at the edge.
📝 Abstract
Inference latency and trustworthiness of Deep Neural Networks (DNNs) are the bottlenecks in deploying them in critical applications like sensitive tasks. Early Exit (EE) DNNs overcome the latency issues by allowing samples to exit from intermediary layers if they attain `high' confidence scores on the predicted class. However, the DNNs are known to exhibit overconfidence, which can lead to many samples exiting early and render EE strategies untrustworthy. We use Selective Prediction (SP) to overcome this issue by checking the `hardness' of the samples rather than just relying on the confidence score alone. We propose SPEED, a novel approach that uses Deferral Classifiers (DCs) at each layer to check the hardness of samples before performing EEs. Specifically, the DCs identify if a sample is hard to predict at an intermediary layer, leading to hallucination, and defer it to an expert. Early detection of hard samples for inference prevents the wastage of computational resources and improves trust by deferring the hard samples to the expert. We demonstrate that EE aided with SP improves both accuracy and latency. Our method minimizes the risk of wrong prediction by $50%$ with a speedup of $2.05 imes$ as compared to the final layer. The anonymized source code is available at https://github.com/Div290/SPEED