🤖 AI Summary
To address the low computational efficiency—manifested as low throughput, low model FLOPs utilization (MFU), and slow convergence—of LLaMA-style architectures during training and inference, while preserving strong learning capability, this paper proposes an inter-layer shared KV cache mechanism. We introduce a novel cross-layer KV cache reconstruction architecture that integrates hierarchical parameter sharing with dynamic lightweight adaptation, significantly reducing KV computation complexity without compromising model expressivity. Experiments show that, when applied to LLaMA, our method improves training token throughput by 77%, increases MFU by 16%, and reduces final loss by 14%; for inference, a 1.1B-parameter model achieves a 7% throughput gain. To our knowledge, this is the first work to enable efficient cross-layer reuse of KV caches, establishing a new paradigm for jointly optimizing training efficiency and model performance in large language models.
📝 Abstract
This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77% higher token-per-second throughput during training, up to 16% higher Model FLOPs Utilization (MFU), and up to 14% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.