SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficiently deploying large language models (LLMs) on resource-constrained edge devices—characterized by limited computational power, small memory capacity, and slow storage—this paper proposes a deployment-aware native edge training paradigm. Methodologically, it introduces a two-level sparse architecture, pre-attention routing to enable compute-storage pipelining, and a hybrid NoPE-RoPE sparse attention mechanism, integrated with fine-grained Mixture-of-Experts (MoE), sparse feed-forward networks (FFNs), optimized KV caching, and Q4_0 quantization. The key contribution is the first demonstration of high-throughput LLM inference on commodity CPUs: we release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, achieving >20 tokens/sec on just 1 GB and 8 GB of RAM, respectively—setting a new state-of-the-art for edge LLM deployment and substantially overcoming longstanding hardware bottlenecks.

Technology Category

Application Category

📝 Abstract
While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.
Problem

Research questions and friction points this paper is trying to address.

Develop efficient LLMs for local devices with weak hardware
Reduce computational demands without sacrificing model capacity
Optimize memory and storage for on-device inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-level sparse structure reduces computational demands
Pre-attention router hides storage latency effectively
NoPE-RoPE hybrid slashes KV cache requirements
🔎 Similar Papers
No similar papers found.
Yixin Song
Yixin Song
Shanghai jiao tong university
Z
Zhenliang Xue
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
D
Dongliang Wei
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
F
Feiyang Chen
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
J
Jianxiang Gao
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
Junchen Liu
Junchen Liu
University of Texas Medical School, Houston, TX
cancer biology
H
Hangyu Liang
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
G
Guangshuo Qin
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
C
Chengrong Tian
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
B
Bo Wen
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
L
Longyu Zhao
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
X
Xinrui Zheng
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
Zeyu Mi
Zeyu Mi
Associate Professor, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
LLM SystemsOperating System
H
Haibo Chen
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University