Zebra-Llama: Towards Extremely Efficient Hybrid Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the low inference efficiency, high retraining cost, and unsustainability of LLM deployment, this paper proposes Zebra: a novel hybrid layer architecture that synergistically integrates State Space Models (SSMs) with Multi-Head Latent Attention (MLA). Zebra enables ultra-low-cost knowledge distillation—requiring only 7–11B tokens—via fine-tuning initialization, lightweight post-training distillation, and aggressive KV cache compression (reducing KV memory to 2–3.9% of the original model). It preserves ≥97% zero-shot performance while achieving Transformer-level accuracy and near-SSM inference efficiency. Evaluated on Zebra-Llama-8B, our method improves few-shot accuracy by 7% over Minitron-8B, reduces training tokens by 8×, cuts KV memory usage by 12×, and achieves 2.6–3.8× higher throughput than MambaInLlama.

Technology Category

Application Category

📝 Abstract

With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and>97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Improving inference efficiency of large language models

Reducing retraining costs for user-specific requirements

Minimizing KV cache size while maintaining accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid models combining SSMs and MLA layers

Efficient knowledge transfer from pre-trained Transformers

Dramatically reduced KV cache size

🔎 Similar Papers

No similar papers found.