Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of existing speculative decoding methods in large language model (LLM)-based generative list recommendation, which overlook the positional semantics of tokens within items and the increasing uncertainty with deeper speculation steps, thereby constraining inference acceleration. To overcome this, the authors propose PAD-Rec, the first approach that jointly models intra-item slot positions and speculation step positions by introducing item-wise and step-wise positional embeddings. A lightweight context-driven gating mechanism is designed to fuse these positional signals, enhancing the draft model’s ability to capture structural awareness and depth-dependent uncertainty without altering the target distribution. Experiments on four real-world datasets demonstrate that PAD-Rec achieves up to 3.1× and an average of approximately 5% end-to-end inference speedup while preserving recommendation quality with negligible degradation.

📝 Abstract

Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the within-item slot of each token, strengthening structural awareness. Step position embeddings encode the draft step, allowing the model to adapt to depth-dependent uncertainty and improve proposal quality. To harmonize these signals with base features, we add simple gates: a learnable coefficient for item slots and a context-driven gate for draft steps. The module is trainable, easy to integrate with standard draft models, and adds negligible inference overhead. Extensive experiments on four real-world datasets show up to 3.1x wall-clock speedup and about 5% average wall-clock speedup gain over strong SD baselines, while largely preserving recommendation quality.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

generative recommendation

position awareness

inference acceleration

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

position-aware drafting

generative recommendation