Skeleton: A New Framework for Accelerating Language Models via Task Neuron Localized Prompt Tuning

📅 2024-04-18

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the high inference overhead in multi-task prompt tuning caused by loading the entire model, this paper proposes Skeleton—a framework that identifies task-critical neurons via gradient-based attribution, enabling neuron-level sparse activation. Skeleton tightly couples lightweight prompt embeddings with dynamic subnetwork selection, activating only task-relevant subnetworks during inference. It is the first method to deeply integrate neuron-level sparsity with prompt tuning, supporting mainstream Transformer architectures including LLaMA and BERT. Evaluated on multiple benchmarks, Skeleton achieves up to 1.73× faster inference, with substantial reductions in memory consumption and latency, while matching the performance of full-parameter prompt tuning—without requiring fine-tuning of the backbone model.

Technology Category

Application Category

📝 Abstract

Prompt tuning methods have shown comparable performance to general training methods as parameter-efficient fine-tuning (PEFT) methods in various natural language understanding tasks. However, existing prompt tuning methods still utilize the entire model architecture even when solving a specific task, which prevents them from accelerating inference speed during the application procedure. In this paper, we propose a novel prompt tuning framework called Skeleton to efficiently utilize a language model in terms of memory and time complexity for solving various tasks, retaining only task-relevant neurons by using an explainability method. From our framework, we can efficiently solve various tasks by using only task-relevant neurons and prepending adequate task-specific prompt tokens with only a single language model. Experiments reveal that our method significantly enhances inference efficiency (at most x 1.73 speed up) for various widely used benchmarks, showing comparable performance to the prompt tuning method. Moreover, our method is applicable across various transformer-based architectures, confirming its practicality and scalability.

Problem

Research questions and friction points this paper is trying to address.

Reducing inference cost without sacrificing accuracy in language models

Dynamically identifying task-specific experts within language models

Maintaining task performance while minimizing computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert identification and activation

Unplug-and-play process for task-specific inference

Attribution methods and prompt tuning isolation

🔎 Similar Papers

No similar papers found.