DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the efficient compression of dense weight matrices in neural networks—such as those in linear layers, 1×1 convolutions, attention projections, and embedding layers—by proposing DiBA, a method that decomposes weights into an alternating product of three diagonal and two binary matrices. This decomposition substantially reduces floating-point multiplications while preserving high approximation accuracy. Key innovations include a structured design combining diagonal scaling with binary mixing, an efficient fine-tuning strategy named DiBARD that avoids re-optimizing binary components, and the DiBA-Greedy alternating optimization algorithm based on closed-form least-squares updates and exact single-bit improvement tests. Experiments demonstrate that DiBA achieves higher signal-to-noise ratios across 40 pretrained models; notably, replacing original weights with DiBA-compressed ones improves DistilBERT’s masked language modeling accuracy from 0.4447 to 0.5210 and boosts audio Transformer classification accuracy from 0.7684 to 0.9781.

📝 Abstract

In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, $1\times1$ convolutions, attention projections, and embedding layers, have dense matrix weights. DiBA approximates $A\in\mathbb{R}^{m\times n}$ by $\widehat A=D_1B_1D_2B_2D_3$, where $D_1,D_2,D_3$ are diagonal matrices and $B_1,B_2$ are $0/1$ binary matrices. The intermediate dimension $k$ controls the trade-off between theoretical storage and approximation accuracy. For matrix-vector products, DiBA decomposes dense multiplication into three element-wise scaling operations and two binary mixing operations, reducing the floating-point multiplication count from $mn$ to $m+k+n$. For optimization, we introduce DiBA-Greedy, an alternating solver that combines closed-form least-squares updates for the diagonal factors with exact one-bit improvement tests for the binary factors. We also introduce DiBARD (DiBA with Retuning only Diagonal factors), which replaces dense-matrix layers by DiBA factors, freezes the binary matrices, and retunes only the diagonal entries on downstream data. This preserves compact binary mixing without discrete search during adaptation. On 40 dense weight matrices extracted from public pretrained models, DiBA-Greedy yields consistent SNR improvements as the theoretical storage ratio increases. After DiBA replacement in two component-replacement studies, DiBARD improves DistilBERT/WikiText masked-token accuracy from 0.4447 to 0.5210 and Speech Commands test accuracy for an Audio Spectrogram Transformer from 0.7684 to 0.9781 without reoptimizing the binary factors.

Problem

Research questions and friction points this paper is trying to address.

weight compression

matrix approximation

neural networks

binary matrices

diagonal matrices

Innovation

Methods, ideas, or system contributions that make the work stand out.

matrix factorization

weight compression

binary matrices