🤖 AI Summary
To address the challenge of efficiently migrating cloud-based re-ranking models to resource-constrained mobile devices while preserving both personalization and recommendation accuracy under device heterogeneity, this paper proposes a cloud-edge collaborative mixed-precision sequential recommendation framework. Our method introduces: (1) a hypernetwork-based user-sensitive parameter identification mechanism enabling multi-granularity sensitivity analysis and dynamic, backpropagation-free adaptation; and (2) channel-wise mixed-precision quantization with a 2-bit encoding scheme, coupled with fine-tuning-free on-device inference optimization—reducing computational and communication overhead significantly without compromising accuracy. Extensive experiments on three real-world datasets demonstrate that our approach substantially improves recommendation accuracy over conventional quantization baselines, while reducing inference latency and bandwidth consumption by large margins.
📝 Abstract
With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for underline{ extbf{C}}ustomizing underline{ extbf{H}}ybrid-precision underline{ extbf{O}}n-device model for sequential underline{ extbf{R}}ecommendation with underline{ extbf{D}}evice-cloud collaboration ( extbf{CHORD}), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to quantization strategy. Through on-device mixed-precision quantization, CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize communication overhead by encoding quantization strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.