LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address challenges in edge-device deployment—including difficulty in domain adaptation, high fine-tuning costs, and excessive inference latency—this paper proposes a cloud-edge collaborative framework for online LoRA generation and fusion. Our method enables a cloud-based large language model to dynamically generate lightweight LoRA adapters from task descriptions, which are then fused into an edge-deployed small model via zero-training, low-overhead reparameterization. This facilitates plug-and-play specialized inference without requiring on-device computation or storage for fine-tuning. Crucially, the approach decouples adaptation logic from edge hardware constraints, balancing accuracy and efficiency. Experiments demonstrate a 2.1× inference speedup on TinyLLaMA-1.1B with accuracy matching full LoRA fine-tuning; on Gemma-2B for agent tasks, it achieves a 10.1× parameter compression ratio. Overall, the framework significantly enhances the practicality and scalability of edge AI systems.

Technology Category

Application Category

📝 Abstract

Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma-2B on intelligent agent tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing domain-specific task performance for small edge-side models

Improving inference efficiency by reducing input context length

Achieving knowledge transfer without specialized training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates LoRA parameters via cloud-side model

Merges LoRA using reparameterization for specialization

Reduces input context length for efficiency

🔎 Similar Papers

No similar papers found.

Authors to Follow