TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination

πŸ“… 2025-10-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) suffer from low inference efficiency and redundant transformer layers that hinder task-specific representation learning. Method: We propose TALEβ€”a training-free, task-aware transformer layer pruning algorithm for inference-time adaptation. TALE jointly quantifies layer importance via mutual information and gradient-based metrics, dynamically identifying and removing bottleneck layers that impede task performance, while enabling adjustable accuracy-efficiency trade-offs. Results: Evaluated across five mainstream LLMs (LLaMA, Qwen, Mistral, etc.) and nine NLP tasks under zero-shot and few-shot settings, TALE reduces model size and accelerates inference, while improving average accuracy. Fine-tuning convergence speed also increases significantly. Crucially, TALE is the first to empirically reveal the phenomenon of β€œlayer-wise suppression of task-relevant representations,” establishing a novel paradigm for efficient LLM adaptation without architectural or training modifications.

Technology Category

Application Category

πŸ“ Abstract
In this paper we introduce Tale, Task-Aware Layer Elimination, an inference-time algorithm that prunes entire transformer layers in an LLM by directly optimizing task-specific validation performance. We evaluate TALE on 9 tasks and 5 models, including LLaMA 3.1 8B, Qwen 2.5 7B, Qwen 2.5 0.5B, Mistral 7B, and Lucie 7B, under both zero-shot and few-shot settings. Unlike prior approaches, TALE requires no retraining and consistently improves accuracy while reducing computational cost across all benchmarks. Furthermore, applying TALE during finetuning leads to additional performance gains. Finally, TALE provides flexible user control over trade-offs between accuracy and efficiency. Mutual information analysis shows that certain layers act as bottlenecks, degrading task-relevant representations. Tale's selective layer removal remedies this problem, producing smaller, faster, and more accurate models that are also faster to fine-tune while offering new insights into transformer interpretability.
Problem

Research questions and friction points this paper is trying to address.

Optimizing task-specific performance by pruning transformer layers
Reducing computational costs while improving model accuracy
Providing flexible control over accuracy-efficiency trade-offs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes transformer layers optimizing task performance
Requires no retraining while reducing computational costs
Provides flexible control over accuracy-efficiency tradeoffs
πŸ”Ž Similar Papers
No similar papers found.