Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address excessive computational and memory overhead in deploying long-context Transformers, this paper proposes Hamiltonian Attention Distillation (HAD)—the first attention mechanism incorporating Hamming distance by binarizing keys and queries to {−1, +1}, replacing dot products with Hamming-distance-based similarity computation, and jointly applying attention matrix sparsification to preserve representational capacity under stringent binary constraints. HAD enables hardware-software co-optimization for custom accelerators. Experiments show HAD incurs only a 1.78% accuracy drop on GLUE (a 7.3% improvement over prior state-of-the-art binarization methods) and a 2.5% drop on ImageNet (a 9.64% gain over SOTA). Hardware synthesis demonstrates 79% reduction in area and 87% lower power consumption. The core innovation lies in the synergistic integration of Hamming-distance-driven attention distillation and high-fidelity binarization design.

Technology Category

Application Category

📝 Abstract

Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $mathbf{1.78}%$ performance losses on GLUE compared to $9.08%$ in state-of-the-art binarization work, and $mathbf{2.5}%$ performance losses on ImageNet compared to $12.14%$, all while targeting custom hardware with a $mathbf{79}%$ area reduction and $mathbf{87}%$ power reduction compared to its standard attention counterpart.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in transformers

Improves efficiency in long-context sequences

Preserves accuracy in binarized transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binarizes keys and queries

Uses Hamming distance computations

Incorporates attention matrix sparsification

🔎 Similar Papers

No similar papers found.

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Sr. Multimodal Model Training and Inference Optimization Engineer

TikTok

San Jose, California

Authors to Follow