Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address communication and energy bottlenecks in large language model (LLM) inference on resource-constrained edge devices, this paper proposes a cloud-edge collaborative inference framework that jointly perceives token-level uncertainty and importance. The method integrates cognitive uncertainty estimation with attention-weight analysis at the token level to dynamically identify and upload high-information tokens to the cloud, while leveraging speculative decoding for lightweight local inference. Its key innovation lies in the first joint application of uncertainty modeling and attention-based importance scoring for adaptive token-level filtering, significantly reducing wireless transmission overhead. Experiments demonstrate that, compared to standard hierarchical LLM (HLM), the approach reduces energy consumption by 40.7%, achieves a BERTScore of 87.5, and attains a throughput of 0.37 tokens/sec; relative to the uncertainty-aware HLM (U-HLM) baseline, it improves both energy efficiency and response latency without compromising generation quality.

Technology Category

Application Category

📝 Abstract
To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communication and energy efficiency. We propose a token-level filtering mechanism for an energy-efficient importance- and uncertainty-aware HLM inference that leverages both epistemic uncertainty and attention-based importance. Our method opportunistically uploads only informative tokens, reducing LLM usage and communication costs. Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to 87.5% BERT Score and token throughput of 0.37 tokens/sec while saving the energy consumption by 40.7% compared to standard HLM. Furthermore, compared to our previous U-HLM baseline, our method improves BERTScore from 85.8% to 87.0%, energy savings from 31.6% to 43.6%, and throughput from 0.36 to 0.40. This approach enables an energy-efficient and accurate deployment of LLMs in bandwidth-constrained edge environments.
Problem

Research questions and friction points this paper is trying to address.

Reducing energy consumption in wireless LLM inference
Optimizing token transmission for bandwidth-constrained edge environments
Balancing accuracy and efficiency in hybrid language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level filtering for energy-efficient HLM
Leverages epistemic uncertainty and importance
Reduces communication costs and energy usage
🔎 Similar Papers
No similar papers found.