π€ AI Summary
To address the challenge of efficiently compressing KV caches in pretrained large language models using multi-head latent attention (MLA) without architectural retraining, this paper proposes X-EcoMLAβa post-training distillation framework enabling MLA adaptation without pretraining from scratch. Methodologically, it integrates knowledge distillation, low-rank joint key-value compression, hybrid attention architecture design, and lightweight parameter adaptation to upgrade standard attention into either hybrid or fully MLA-based variants. Evaluated on Llama3.2-1B-Inst, X-EcoMLA achieves a 6.4Γ KV cache compression ratio while preserving 100% of LM Harness average task performance. Training requires only 3.6B tokens and 70 GPU hours (on AMD MI300), reducing computational cost by 99.98% compared to original pretraining. Key contributions include: (i) the first retraining-free MLA migration framework; (ii) zero accuracy degradation under extreme KV compression; and (iii) substantially lowered deployment barriers for MLA-enhanced inference.
π Abstract
Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA's benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid (i.e., combination of regular attention and MLA layers) or full MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. Our results show that using an 8B teacher model allows us to compress the KV cache size of the Llama3.2-1B-Inst baseline by 6.4x while preserving 100% of its average score across multiple tasks on the LM Harness Evaluation benchmark. This is achieved with only 3.6B training tokens and about 70 GPU hours on AMD MI300 GPUs, compared to the 370K GPU hours required for pre-training the Llama3.2-1B model.