Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

📅 2024-08-20

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 3

career value

220K/year

🤖 AI Summary

Edge-device GPUs face severe memory constraints, hindering fine-tuning and multimodal extension of large language models (LLMs). Method: We systematically survey memory-efficient fine-tuning techniques (e.g., LoRA, QLoRA, Adapters) and model compression methods (e.g., quantization, pruning, knowledge distillation, sparse training), and propose, for the first time, a synergistic fine-tuning-and-compression paradigm tailored for edge deployment. We design a unified evaluation framework that quantifies trade-offs across three dimensions: energy efficiency, hardware compatibility, and multimodal generalization capability. Contribution/Results: We establish the first taxonomy of LLM lightweighting techniques specifically for edge deployment, characterizing each method’s performance in GPU memory footprint, inference latency, accuracy retention, and cross-platform adaptability. Our work provides both theoretical foundations and practical guidelines for sustainable on-device AI deployment.

Technology Category

Application Category

📝 Abstract

Since the invention of GPT2--1.5B in 2019, large language models (LLMs) have transitioned from specialized models to versatile foundation models. The LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment. Traditional fine-tuning techniques with the first-order optimizers require substantial GPU memory that exceeds mainstream hardware capability. Therefore, memory-efficient methods are motivated to be investigated. Model compression techniques can reduce energy consumption, operational costs, and environmental impact so that to support sustainable artificial intelligence advancements. Additionally, large-scale foundation models have expanded to create images, audio, videos, and multi-modal contents, further emphasizing the need for efficient deployment. Therefore, we are motivated to present a comprehensive overview of the prevalent memory-efficient fine-tuning methods over the network edge. We also review the state-of-the-art literatures on model compression to provide a vision on deploying LLMs over the network edge.

Problem

Research questions and friction points this paper is trying to address.

Efficient fine-tuning of LLMs on edge devices with limited memory

Deploying large-scale multi-modal foundation models at network edges

Reducing operational costs for LLM deployment via compression techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-efficient fine-tuning for edge deployment

Model compression to reduce operational costs

Comprehensive overview of edge LLM deployment

🔎 Similar Papers

Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines

2024-09-23Citations: 1

Netflix

$466,000.00 - $750,000.00

Los Gatos,California,United States of America / Los Angeles,California,United States of America

Authors to Follow