Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding

📅 2024-08-11

📈 Citations: 4

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address high inference latency and substantial computational overhead in large language model (LLM)-based recommender systems during offline knowledge generation, this paper proposes LASER, a retrieval-augmented speculative decoding framework. Methodologically, LASER adapts speculative decoding—previously unexplored in recommendation-driven knowledge generation—to accelerate LLM inference. It introduces a user- and item-aware customized retrieval pool to mitigate retrieval bottlenecks arising from massive entity catalogs, and designs a relaxed-threshold token verification mechanism that guarantees zero accuracy degradation while significantly improving draft token acceptance rates. Experimental results demonstrate that LASER achieves 3–5× inference speedup on public benchmarks and reduces computational resource consumption by 67% in large-scale online advertising A/B tests, all without compromising downstream recommendation performance.

Technology Category

Application Category

📝 Abstract

The past few years have witnessed a growing interest in LLM-based recommender systems (RSs), although their industrial deployment remains in a preliminary stage. Most existing deployments leverage LLMs offline as feature enhancers, generating augmented knowledge for downstream tasks. However, in recommendation scenarios with numerous users and items, even offline knowledge generation with LLMs demands significant time and computational resources. This inefficiency arises from the autoregressive nature of LLMs. A promising solution is speculative decoding, a Draft-Then-Verify approach that increases the number of tokens generated per decoding step. In this work, we first identify recommendation knowledge generation as a highly fitting use case for retrieval-based speculative decoding. Then, we discern its two characteristics: (1) the vast number of items and users in RSs leads to retrieval inefficiency, and (2) RSs exhibit high diversity tolerance for LLM-generated text. Building on these insights, we introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER), which features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens. LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test on a large-scale advertising scenario with lossless downstream recommendation performance. Our code is available at https://github.com/YunjiaXi/LASER

Problem

Research questions and friction points this paper is trying to address.

Accelerate LLM-based recommender systems with speculative decoding

Overcome retrieval inefficiency in large-scale recommendation scenarios

Maintain recommendation performance while reducing computational resource usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses retrieval-based speculative decoding

Customized Retrieval Pool for efficiency

Relaxed Verification for higher acceptance

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling