On-Device Large Language Models for Sequential Recommendation

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of deploying large language models (LLMs) on resource-constrained edge devices for sequential recommendation, where high memory and computational demands severely limit practical applicability. To this end, the authors propose OD-LLM, a novel framework that integrates low-rank structured compression based on singular value decomposition, token normalization, and a layer-wise progressive alignment algorithm to achieve task-adaptive model compression. The proposed approach effectively reduces model size by 50% while preserving recommendation performance comparable to that of the original uncompressed model. This significant compression efficiency substantially enhances the feasibility and scalability of deploying LLMs in real-time, edge-based recommendation scenarios, thereby bridging the gap between powerful language models and practical on-device applications.

Technology Category

Application Category

📝 Abstract

On-device recommendation is critical for a number of real-world applications, especially in scenarios that have agreements on execution latency, user privacy, and robust functionality when internet connectivity is unstable or even impossible. While large language models (LLMs) can now provide exceptional capabilities that model user behavior for sequential recommendation tasks, their substantial memory footprint and computational overhead make the deployment on resource-constrained devices a high risk proposition. In this paper, we propose OD-LLM, the first task-adaptive compression framework explicitly designed to provide efficient and accurate on-device deployment of LLMs for sequential recommendation tasks. OD-LLM uniquely integrates two complementary compression strategies: a low-rank structural compression algorithm which uses Singular Value Decomposition (SVD) to significantly reduce parameter redundancy in the model, and a novel tokenization normalization technique that better complements the low-rank decomposition process being used. Additionally, to minimize any potential performance degradation when using higher compression ratios, a novel progressive alignment algorithm is used to iteratively refine the parameters required layerwise in the target model. Empirical evaluations conducted on sequential recommendation benchmarks show that OD-LLM exhibits no loss in effectiveness when compared to the original recommendation model, when the deployed model size is halved. These promising results demonstrate the efficacy and scalability of OD-LLM, making this novel solution a practical alternative for real-time, on-device solutions wishing to replace expensive, remotely executed LLMs.

Problem

Research questions and friction points this paper is trying to address.

on-device recommendation

large language models

sequential recommendation

model compression

resource-constrained devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-device LLM

low-rank compression

tokenization normalization