🤖 AI Summary
This work addresses the low inference efficiency of Transformers by proposing an adaptive rank allocation framework coupled with the RaNA adapter. It jointly integrates low-rank decomposition and dynamic neuron masking into the linear modules of both MLP and attention layers—marking the first approach to unify low-rank adaptation and adaptive masking in rank space, eliminating reliance on activation sparsity or dedicated maskers, and enabling universal acceleration across full Transformer architectures, including attention. The method synergistically combines FLOPs-aware optimization, parameter-efficient fine-tuning (PEFT), and adaptive rank control for linear layers. Experiments demonstrate that, with ~44% FLOPs reduction, the approach achieves up to a 7-point decrease in perplexity and an 8-percentage-point gain in accuracy—substantially outperforming existing neuron-adaptive methods.
📝 Abstract
Large Language Models (LLMs) are computationally intensive, particularly during inference. Neuron-adaptive techniques, which selectively activate neurons in Multi-Layer Perceptron (MLP) layers, offer some speedups but suffer from limitations in modern Transformers. These include reliance on sparse activations, incompatibility with attention layers, and the use of costly neuron masking techniques. To address these issues, we propose the Adaptive Rank Allocation framework and introduce the Rank and Neuron Allocator (RaNA) adapter. RaNA adapters leverage rank adapters, which operate on linear layers by applying both low-rank matrix decompositions and adaptive masking to efficiently allocate compute without depending on activation sparsity. This enables RaNA to be generally applied to MLPs and linear components of attention modules, while eliminating the need for expensive maskers found in neuron-adaptive methods. Notably, when compared to neuron adapters, RaNA improves perplexity by up to 7 points and increases accuracy by up to 8 percentage-points when reducing FLOPs by $sim$44% in state-of-the-art Transformer architectures. These results position RaNA as a robust solution for improving inference efficiency in modern Transformer architectures.