Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

πŸ“… 2026-01-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the limitations of traditional log severity classification in evaluating models’ semantic understanding and reasoning capabilities. For the first time, this task is employed as a probe to systematically assess nine small language models (SLMs/SRLMs) under zero-shot, few-shot, and retrieval-augmented generation (RAG) settings, with a focus on log comprehension and deployability. Experiments conducted on real-world Linux journalctl logs reveal the joint impact of model architecture, training objectives, and retrieval integration mechanisms on performance. Results show that Qwen3-4B achieves 95.64% accuracy under RAG, while the compact Qwen3-0.6B attains 88.12%, demonstrating high efficiency. Notably, certain SRLMs exhibit significant performance degradation under RAGβ€”e.g., Phi-4-Mini-Reasoning drops below 10% accuracy with inference latency exceeding 228 seconds.

Technology Category

Application Category

πŸ“ Abstract
System logs are crucial for monitoring and diagnosing modern computing infrastructure, but their scale and complexity require reliable and efficient automated interpretation. Since severity levels are predefined metadata in system log messages, having a model merely classify them offers limited standalone practical value, revealing little about its underlying ability to interpret system logs. We argue that severity classification is more informative when treated as a benchmark for probing runtime log comprehension rather than as an end task. Using real-world journalctl data from Linux production servers, we evaluate nine small language models (SLMs) and small reasoning language models (SRLMs) under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting. The results reveal strong stratification. Qwen3-4B achieves the highest accuracy at 95.64% with RAG, while Gemma3-1B improves from 20.25% under few-shot prompting to 85.28% with RAG. Notably, the tiny Qwen3-0.6B reaches 88.12% accuracy despite weak performance without retrieval. In contrast, several SRLMs, including Qwen3-1.7B and DeepSeek-R1-Distill-Qwen-1.5B, degrade substantially when paired with RAG. Efficiency measurements further separate models: most Gemma and Llama variants complete inference in under 1.2 seconds per log, whereas Phi-4-Mini-Reasoning exceeds 228 seconds per log while achieving<10% accuracy. These findings suggest that (1) architectural design, (2) training objectives, and (3) the ability to integrate retrieved context under strict output constraints jointly determine performance. By emphasizing small, deployable models, this benchmark aligns with real-time requirements of digital twin (DT) systems and shows that severity classification serves as a lens for evaluating model competence and real-time deployability, with implications for root cause analysis (RCA) and broader DT integration.
Problem

Research questions and friction points this paper is trying to address.

system log severity classification
small language models
log comprehension
real-time deployability
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

small language models
log severity classification
retrieval-augmented generation
digital twin
runtime log comprehension
πŸ”Ž Similar Papers
No similar papers found.
Y
Yahya Masri
George Mason University
E
Emily Ma
George Mason University
Zifu Wang
Zifu Wang
Shanghai AI Laboratory
Large Language Models
J
Joseph Rogers
George Mason University
C
Chaowei Yang
George Mason University