SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the high memory and computational overhead hindering large language model (LLM) deployment on mobile devices, this paper proposes a lightweight parameter-sharing and recovery framework for Transformers. It shares MLP weights between adjacent layers to drastically compress model size and employs low-rank compensation matrices to restore performance. A novel two-stage recovery mechanism is introduced: SLW initialization followed by minimal supervised fine-tuning (≤50K samples), eliminating the need for pretraining-level resources. Empirical analysis reveals that sharing weights in deeper layers better preserves performance, and L2 output alignment is incorporated to enhance training stability. Evaluated on Llama2-7B, the method achieves 42.8% storage reduction, 42.2% inference speedup, and 38–65% reduction in MLP parameters—while fully recovering perplexity. The approach balances efficiency, practicality, and cross-model transferability, establishing a new paradigm for edge-device LLM deployment.

Technology Category

Application Category

📝 Abstract

While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, we propose SHARP (SHaring Adjacent Layers with Recovery Parameters), a novel approach to accelerate LLM inference by sharing parameters across adjacent layers, thus reducing memory load overhead, while introducing low-rank recovery parameters to maintain performance. Inspired by observations that consecutive layers have similar outputs, SHARP employs a two-stage recovery process: Single Layer Warmup (SLW), and Supervised Fine-Tuning (SFT). The SLW stage aligns the outputs of the shared layers using L_2 loss, providing a good initialization for the following SFT stage to further restore the model performance. Extensive experiments demonstrate that SHARP can recover the model's perplexity on various in-distribution tasks using no more than 50k fine-tuning data while reducing the number of stored MLP parameters by 38% to 65%. We also conduct several ablation studies of SHARP and show that replacing layers towards the later parts of the model yields better performance retention, and that different recovery parameterizations perform similarly when parameter counts are matched. Furthermore, SHARP saves 42.8% in model storage and reduces the total inference time by 42.2% compared to the original Llama2-7b model on mobile devices. Our results highlight SHARP as an efficient solution for reducing inference costs in deploying LLMs without the need for pretraining-scale resources.

Problem

Research questions and friction points this paper is trying to address.

Reduce memory load in LLM inference

Accelerate inference on resource-constrained devices

Maintain performance with fewer parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared adjacent layer parameters

Low-rank recovery parameters

Two-stage recovery process

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models