TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

📅 2025-01-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM compression methods predominantly target feed-forward networks (FFNs), neglecting optimization of multi-head attention (MHA) modules—leading to low compression ratios, high storage overhead, and marginal performance gains. This work introduces the first tensorization-based approach for MHA weight compression, proposing a structured denoising framework grounded in multi-head tensorization and Tucker decomposition to model high-dimensional shared subspaces across attention heads. The method is training-free, requires no auxiliary data or fine-tuning, and enables zero-shot weight repurposing. It is orthogonal to FFN denoising techniques and seamlessly integrates with them. Evaluated across diverse reasoning benchmarks, it achieves up to ~250× MHA weight compression while significantly improving inference performance. The framework is architecture-agnostic, supporting both encoder-only and decoder-only models.

Technology Category

Application Category

📝 Abstract

The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Multi-head Attention

Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

TensorLLM

Multi-head Attention Optimization

Efficient Data Processing

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models

2024-09-22Citations: 0

Authors to Follow