🤖 AI Summary
Generative recommender systems suffer from excessive computational and memory overhead due to long input sequences induced by semantic identifiers (SIDs) for item representation. While prior work optimizes attention mechanisms or KV caching, this paper introduces the first **representation-aware semantic token pruning method**, which dynamically identifies and removes low-informativeness tokens by jointly modeling their **representation magnitude** and **attention centrality**. Our approach integrates semantic saliency analysis, cumulative attention weight estimation, and adaptive pruning, enabling sequence compression without compromising recommendation accuracy. Extensive experiments on three Amazon datasets demonstrate an average 26.7% reduction in training time, while maintaining or improving key metrics such as Recall@10. The implementation is publicly available.
📝 Abstract
Generative recommendation systems typically leverage Semantic Identifiers (SIDs), which represent each item as a sequence of tokens that encode semantic information. However, representing item ID with multiple SIDs significantly increases input sequence length, which is a major determinant of computational complexity and memory consumption. While existing efforts primarily focus on optimizing attention computation and KV cache, we propose RASTP (Representation-Aware Semantic Token Pruning), which directly prunes less informative tokens in the input sequence. Specifically, RASTP evaluates token importance by combining semantic saliency, measured via representation magnitude, and attention centrality, derived from cumulative attention weights. Since RASTP dynamically prunes low-information or irrelevant semantic tokens, experiments on three real-world Amazon datasets show that RASTP reduces training time by 26.7%, while maintaining or slightly improving recommendation performance. The code has been open-sourced at https://github.com/Yuzt-zju/RASTP.