GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

๐Ÿ“… 2026-05-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

230K/year
๐Ÿค– AI Summary
This work addresses the challenge of balancing energy efficiency and throughput in speculative decoding under resource-constrained edge environments. The authors propose GELATO, a novel framework that uniquely integrates Lyapunov optimization with a generation entropyโ€“driven early-exit mechanism. In this architecture, an outer-loop Lyapunov controller establishes a draft-generation budget by managing the long-term trade-off between energy consumption and throughput, while an inner-loop component dynamically triggers early exits at the token level based on real-time generation entropy, enabling fine-grained, energy-aware adaptive offloading. Evaluated against state-of-the-art distributed speculative decoding approaches, GELATO achieves a 64.98% improvement in token throughput and a 47.47% reduction in energy consumption, all while preserving decoding quality and guaranteeing rigorous long-term throughput performance bounds.
๐Ÿ“ Abstract
The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candidate tokens to be verified by a powerful target model. However, a fundamental challenge lies in achieving per-token resource scheduling to effectively adapt SD paradigm to resource-constrained edge environment. This paper proposes a Generative Entropy- and Lyapunov-based Adaptive Token Offloading framework, named GELATO, to maximize decoding throughput under energy constraints in a device-edge collaborative SD system. Specifically, an outer drift-plus-penalty loop makes online decisions to establish a reference drafting budget, managing long-term energy-throughput trade-off. Further, a nested entropy-driven generation mechanism executes early exiting to adapt to per-token dynamic generative uncertainty. Theoretical analysis establishes a rigorous performance bound on long-term throughput for GELATO. Extensive evaluations demonstrate that GELATO achieves a globally optimal tradeoff, outperforming state-of-the-art distributed SD architectures by 64.98% in token throughput and reducing energy consumption by 47.47% under resource-constrained environments, while preserving LLM decoding quality.
Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding
Device-Edge Collaboration
Token Offloading
Resource Scheduling
Energy Constraint
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Adaptive Token Offloading
Generative Entropy
Lyapunov Optimization
Device-Edge Collaboration