🤖 AI Summary
Small language models (SLMs) remain underexplored academically, with no systematic evaluation framework addressing their architectural diversity, training data, optimization strategies, and on-device performance. Method: This work introduces the first multidimensional benchmarking suite covering model architecture, pretraining data, training methodology, and edge-device efficiency, empirically evaluating 70 open-source SLMs (100M–5B parameters) across commonsense reasoning (MMLU), mathematical reasoning (GSM8K), code generation (HumanEval), in-context learning, and real-world edge deployment metrics—including latency and memory footprint. Contribution/Results: We uncover nonlinear trade-offs between capability and efficiency; identify SLMs approaching large language model (LLM) performance on specific tasks; reveal that parameter count and data quality exhibit diminishing returns beyond certain thresholds; and release the first reproducible, standardized benchmark dataset for on-device SLM inference—enabling rigorous, comparable research toward democratized edge AI.
📝 Abstract
Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 70 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.