Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

📅 2024-08-16
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
To address high inference latency in large language models (LLMs) and limitations of existing speculative decoding methods—including reliance on auxiliary architectures or external retrieval, substantial storage overhead, and poor adaptability—this paper proposes Token Recycling, a training-free acceleration technique. Its core innovation lies in exploiting intrinsic token repetition emerging during autoregressive decoding: it dynamically constructs an adjacency matrix and a BFS-style draft tree from previously generated tokens, then validates candidates efficiently via tree attention while updating caches online. Token Recycling requires no additional training, external corpora, or parameter fine-tuning, and consumes less than 2 MB of memory. It achieves ~2× inference speedup across LLMs of diverse scales. Compared to prior training-free methods, it improves throughput by 30%; against leading training-based approaches, it outperforms by 25%. The method is distinguished by its minimal resource footprint, broad model compatibility, and plug-and-play deployment.

Technology Category

Application Category

📝 Abstract
Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based training-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. It stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires extless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30% and even a widely recognized training method by 25%.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in large language models
Eliminating need for extra training in speculative decoding
Minimizing storage and retrieval challenges in draft token generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Recycling stores candidates in adjacency matrix
BFS-like algorithm constructs draft tree for validation
Requires <2MB storage, achieves ~2x speedup
🔎 Similar Papers
No similar papers found.
Xianzhen Luo
Xianzhen Luo
Harbin Institute of Technology
Code IntelligenceInference Acceleration
Y
Yixuan Wang
Harbin Institute of Technology, Harbin, China
Qingfu Zhu
Qingfu Zhu
Harbin Institute of Technology
NLPCode LLM
Z
Zhiming Zhang
Harbin Institute of Technology, Harbin, China
X
Xuanyu Zhang
Du Xiaoman (Beijing) Science Technology Co., Ltd.
Q
Qing Yang
Du Xiaoman (Beijing) Science Technology Co., Ltd.
D
Dongliang Xu
Du Xiaoman (Beijing) Science Technology Co., Ltd.
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing