🤖 AI Summary
To address the high memory footprint and slow inference of large language models (LLMs) on resource-constrained edge devices, this paper proposes the lightweight Side-Plugin Adaptation (SPA) architecture for efficient cloud-edge collaborative seq2seq generation. Our method introduces a novel cloud-edge parameter decoupling paradigm: general knowledge is retained in the cloud-based LLM, while personalized parameters are deployed on-device as ultra-lightweight side plugins—supporting hierarchical deployment and incremental fine-tuning. Under stringent memory and computational constraints, SPA ensures fast, stable on-device inference. Experimental results demonstrate that SPA reduces inference latency by 42% and memory consumption by 67% compared to state-of-the-art on-device LLM approaches, significantly enhancing both personalization efficiency and practical deployability.
📝 Abstract
Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and featured personal feature. Further more, SPA provides a framework to keep feature-base parameters on low computational devices while leave the parameters containing general information on the high computational devices.