Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low inference efficiency of Transformers by proposing an adaptive rank allocation framework coupled with the RaNA adapter. It jointly integrates low-rank decomposition and dynamic neuron masking into the linear modules of both MLP and attention layers—marking the first approach to unify low-rank adaptation and adaptive masking in rank space, eliminating reliance on activation sparsity or dedicated maskers, and enabling universal acceleration across full Transformer architectures, including attention. The method synergistically combines FLOPs-aware optimization, parameter-efficient fine-tuning (PEFT), and adaptive rank control for linear layers. Experiments demonstrate that, with ~44% FLOPs reduction, the approach achieves up to a 7-point decrease in perplexity and an 8-percentage-point gain in accuracy—substantially outperforming existing neuron-adaptive methods.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are computationally intensive, particularly during inference. Neuron-adaptive techniques, which selectively activate neurons in Multi-Layer Perceptron (MLP) layers, offer some speedups but suffer from limitations in modern Transformers. These include reliance on sparse activations, incompatibility with attention layers, and the use of costly neuron masking techniques. To address these issues, we propose the Adaptive Rank Allocation framework and introduce the Rank and Neuron Allocator (RaNA) adapter. RaNA adapters leverage rank adapters, which operate on linear layers by applying both low-rank matrix decompositions and adaptive masking to efficiently allocate compute without depending on activation sparsity. This enables RaNA to be generally applied to MLPs and linear components of attention modules, while eliminating the need for expensive maskers found in neuron-adaptive methods. Notably, when compared to neuron adapters, RaNA improves perplexity by up to 7 points and increases accuracy by up to 8 percentage-points when reducing FLOPs by $sim$44% in state-of-the-art Transformer architectures. These results position RaNA as a robust solution for improving inference efficiency in modern Transformer architectures.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational intensity in Large Language Models (LLMs) during inference
Overcoming limitations of neuron-adaptive techniques in modern Transformers
Efficiently allocating compute without relying on activation sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Rank Allocation for efficient compute
Rank and Neuron Allocator (RaNA) adapters
Low-rank decomposition and adaptive masking
🔎 Similar Papers
No similar papers found.
R
Roberto Garcia
Institute of Computational and Mathematical Engineering, Stanford University
J
Jerry Liu
Institute of Computational and Mathematical Engineering, Stanford University
D
Daniel Sorvisto
Institute of Computational and Mathematical Engineering, Stanford University
Sabri Eyuboglu
Sabri Eyuboglu
PhD Student in Computer Science, Stanford University
Machine learning