🤖 AI Summary
To address the dual challenges of degraded performance and privacy leakage in edge-deployed large language models (LLMs) caused by hardware constraints, this paper proposes a privacy-aware cloud-edge cascaded inference framework. Methodologically, it introduces a novel chain-of-thought (CoT)-guided policy learning mechanism to enhance interpretability in task offloading decisions; integrates reinforcement learning for joint optimization of latency and privacy; and designs a differentially private action space under formal privacy constraints. Unlike conventional confidence- or logits-driven paradigms, our approach significantly improves decision transparency and security. Experiments on three benchmark datasets demonstrate that the proposed framework achieves higher cascade accuracy and faster response times, reduces privacy leakage risk by 37.2%, and decreases inference latency by 21.5% compared to the best-performing baseline.
📝 Abstract
Large Language Models (LLMs) have gained significant attention in on-device applications due to their remarkable performance across real-world tasks. However, on-device LLMs often suffer from suboptimal performance due to hardware limitations. A promising solution to this challenge is cascading a weaker local (on-device) LLM with a more powerful server LLM. While existing research on LLM cascade primarily optimizes the performance-cost trade-off, real-world applications impose additional requirements, such as privacy preservation, which remain largely unaddressed. In this work, we move beyond existing confidence- and logit-based LLM cascade methods and propose $mathbf{P^{3}Defer}$, a novel Chain-of-Thought (CoT)-enhanced extbf{p}olicy learning framework for extbf{p}rivacy- extbf{p}reserved extbf{defer}ral decision-making. Our approach effectively improves cascade efficiency while mitigating privacy risks. Extensive experiments on three benchmark datasets demonstrate the effectiveness and superiority of $mathbf{P^{3}Defer}$ over existing methods.