SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation

📅 2024-03-11

🏛️ Pacific Rim International Conference on Artificial Intelligence

📈 Citations: 2

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the high memory footprint and slow inference of large language models (LLMs) on resource-constrained edge devices, this paper proposes the lightweight Side-Plugin Adaptation (SPA) architecture for efficient cloud-edge collaborative seq2seq generation. Our method introduces a novel cloud-edge parameter decoupling paradigm: general knowledge is retained in the cloud-based LLM, while personalized parameters are deployed on-device as ultra-lightweight side plugins—supporting hierarchical deployment and incremental fine-tuning. Under stringent memory and computational constraints, SPA ensures fast, stable on-device inference. Experimental results demonstrate that SPA reduces inference latency by 42% and memory consumption by 67% compared to state-of-the-art on-device LLM approaches, significantly enhancing both personalization efficiency and practical deployability.

Technology Category

Application Category

📝 Abstract

Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and featured personal feature. Further more, SPA provides a framework to keep feature-base parameters on low computational devices while leave the parameters containing general information on the high computational devices.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory storage for LLMs on low-resource devices

Improving computational speed on constrained devices

Enabling cloud-device collaboration for personalized generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight architecture for fast on-device inference

Cloud-device collaboration with additive parameters

Feature-base parameters on low-resource devices

🔎 Similar Papers

No similar papers found.