π€ AI Summary
To address low inference efficiency, poor batch processing capability, high training-data requirements, and performance degradation of large language models (LLMs) on resource-constrained devices, this paper introduces the Llamba family of efficient recurrent language models. We propose MOHAWK, the first cross-architecture knowledge distillation framework, which faithfully transfers Llama-3.x models to the Mamba state-space architecture via architecture-aligned distillation and an edge-optimized inference engine. The method requires only 0.1% of the original training data for adaptation. On edge devices such as smartphones, it achieves a 2.3Γ average inference speedup and 68% memory reduction. Llamba-1B/3B/8B attain 97.2% of Llama-3.1βs performance on standard benchmarks including GLUE and MMLUβmarking the first successful, high-fidelity compression of Llama-level capabilities into lightweight recurrent architectures.
π Abstract
We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.