🤖 AI Summary
To address high latency, excessive memory consumption, and weak multi-task capability of Liquid Foundation Models (LFMs) on edge devices, this paper introduces LFM2—a family of efficient on-device models. Methodologically: (i) we propose the first hardware-aware loop architecture search to jointly design a lightweight hybrid backbone integrating gated convolutions and grouped-query attention; (ii) we devise a three-stage post-training pipeline combining curriculum learning, decoupled Top-K knowledge distillation, and length-normalized preference optimization; (iii) the models support deployment via ExecuTorch and llama.cpp. Experiments show that LFM2 variants (350M–8.3B parameters) significantly outperform comparable lightweight models on IFEval (79.56%) and GSM8K (82.41%), achieve 2× faster CPU prefilling and decoding, and enable multilingual retrieval and real-time speech interaction. Code, weights, and deployment packages are publicly released.
📝 Abstract
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.