TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

๐Ÿ“… 2025-01-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM compression methods predominantly target feed-forward networks (FFNs), neglecting optimization of multi-head attention (MHA) modulesโ€”leading to low compression ratios, high storage overhead, and marginal performance gains. This work introduces the first tensorization-based approach for MHA weight compression, proposing a structured denoising framework grounded in multi-head tensorization and Tucker decomposition to model high-dimensional shared subspaces across attention heads. The method is training-free, requires no auxiliary data or fine-tuning, and enables zero-shot weight repurposing. It is orthogonal to FFN denoising techniques and seamlessly integrates with them. Evaluated across diverse reasoning benchmarks, it achieves up to ~250ร— MHA weight compression while significantly improving inference performance. The framework is architecture-agnostic, supporting both encoder-only and decoder-only models.

Technology Category

Application Category

๐Ÿ“ Abstract
The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Multi-head Attention
Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

TensorLLM
Multi-head Attention Optimization
Efficient Data Processing
Yuxuan Gu
Yuxuan Gu
Harbin Institute of Technology
Text Generation
W
Wuyang Zhou
Department of Electrical and Electronic Engineering, Imperial College London, United Kingdom
G
Giorgos Iacovides
Department of Electrical and Electronic Engineering, Imperial College London, United Kingdom
Danilo Mandic
Danilo Mandic
Prof. of Machine Intelligence, Dept of Electrical and Electronic Eng., Imperial College London, UK
Machine Intelligence and Statistical Signal Proc.Biomedicine and FinanceHearables and Ear-EEGDeep RNNsTensors and Graphs