🤖 AI Summary
This work addresses the limitations of existing methods for converting pretrained attention modules to Multi-head Latent Attention (MLA), which neglect the covariance structure of input activations and employ uniform rank allocation, leading to activation shift and degraded attention fidelity. To overcome these issues, the authors propose the Covariance-Aware Rank-Efficient (CARE) conversion pipeline, which—under a fixed KV cache width—integrates covariance-aware low-rank decomposition, layer-adaptive rank allocation, and KV-aligned reparameterization to achieve both efficiency and high fidelity. Notably, CARE introduces, for the first time, an activation covariance-aware mechanism combined with non-uniform rank configuration and lightweight fine-tuning. Evaluated on Qwen3 and Llama-3.1, CARE reduces single-sample perplexity by up to 215× and improves average accuracy by 1.70× over SVD baselines, fully recovering original model performance with only brief fine-tuning.
📝 Abstract
Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.