Token Level Routing Inference System for Edge Devices

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on resource-constrained edge devices faces fundamental challenges in computational efficiency and the quality–latency trade-off. This paper proposes a fine-grained collaborative decoding framework featuring a novel token-level dynamic routing mechanism: a lightweight on-device model performs primary inference, while only low-confidence critical tokens are selectively offloaded to a cloud-based LLM for auxiliary generation. The method integrates confidence-aware importance scoring, low-latency API scheduling, edge-cloud cooperative caching, and an asynchronous decoding pipeline—collectively minimizing communication overhead and computational cost. Evaluated on a M1 MacBook using a 0.5B-parameter model, our approach achieves a 60% accuracy gain on CommonsenseQA, with fewer than 7% of tokens uploaded to the cloud. This demonstrates a new paradigm for lightweight, efficient, and high-quality edge LLM inference.

Technology Category

Application Category

📝 Abstract
The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.
Problem

Research questions and friction points this paper is trying to address.

Balancing LLM inference efficiency on edge devices
Improving small model quality without sacrificing speed
Enabling selective cloud assistance for critical tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token level routing for edge devices
Collaborative decoding with cloud-based LLM
Selective critical token generation
🔎 Similar Papers
No similar papers found.
J
Jianshu She
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
W
Wenhao Zheng
University of North Carolina at Chapel Hill
Zhengzhong Liu
Zhengzhong Liu
Institute of Foundation Models
Natural Language ProcessingMachine Learning
H
Hongyi Wang
Computer Science Department at Rutgers University
E
Eric P. Xing
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Huaxiu Yao
Huaxiu Yao
Assistant Professor of Computer Science and Data Science, UNC Chapel Hill
Machine LearningFoundation ModelsAI AlignmentAI AgentRobot Learning
Qirong Ho
Qirong Ho
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and Petuum, Inc