How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks

๐Ÿ“… 2026-03-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenge of deploying large language models (LLMs) in AI-native 6G networks, where high computational demands hinder edge-device deployment. To balance semantic reasoning capability with resource efficiency, the authors systematically evaluate models ranging from 135M to 7B parameters across 30 standardized decision tasks using the 6G-Bench benchmark. They introduce the Edge Scoreโ€”a normalized metric integrating accuracy, latency, and memory usageโ€”and compare architectures including SmolLM2, Llama-3.2, and Qwen2.5. Results reveal that models of 1.5Bโ€“3B parameters achieve the optimal trade-off between deterministic reasoning stability and computational efficiency. Performance exhibits a non-monotonic trend with scale: stability improves markedly between 1Bโ€“1.5B, but gains diminish beyond 3B. While the 135M model achieves 0.224 accuracy and the 7B model reaches 0.707, medium-scale models deliver the highest semantic reliability per unit of edge resource.

Technology Category

Application Category

๐Ÿ“ Abstract
Emerging 6G visions, reflected in ongoing standardization efforts within 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, increasingly characterize networks as AI-native systems in which high-level semantic reasoning layers operate above standardized control and data-plane functions. Although frontier-scale large language models (LLMs) such as Qwen2.5-7B and Olmo-3-7B demonstrate strong reasoning capability, their computational footprint limits deployment in latency-sensitive, edge-native infrastructures. This paper presents a systematic empirical study of the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems. Using 6G-Bench, a standardization-aligned benchmark comprising 30 decision-making tasks across five capability domains, we evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B. Deterministic accuracy (pass@1) increases from 0.224 at 135M to 0.707 at 7B, but scaling gains are highly non-uniform. A pronounced stability transition occurs in the 1 to 1.5B range, where accuracy rises from 0.373 (Llama-3.2-1B) to 0.531 (Qwen2.5-1.5B) and the instability gap Delta_5 contracts from 0.356 to 0.138. Beyond 3B parameters, improvements diminish (+0.064 from 3B to 7B). Through single-query inference profiling and an Edge Score metric that normalizes accuracy by latency and memory footprint, we show that semantic reliability per unit edge resource does not scale monotonically with parameter count. Instead, mid-scale models (approximately 1.5 to 3B) achieve the most favorable balance between deterministic stability and computational efficiency, providing deployment-relevant guidance for AI-native 6G architectures. All scripts and results are publicly available at https://github.com/maferrag/6G-Bench
Problem

Research questions and friction points this paper is trying to address.

6G
AI-native networks
tiny language models
semantic reasoning
edge deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tiny Language Models
AI-Native 6G
Semantic Reasoning
Edge Efficiency
Model Scaling
๐Ÿ”Ž Similar Papers
No similar papers found.