🤖 AI Summary
This study addresses the insufficient evaluation of open-source large language models (LLMs) on telecom-domain question answering. We introduce the first benchmark comprising 105 technical question-answer pairs targeting advanced wireless communications, covering both factual and reasoning-based queries. To systematically assess correctness, consistency, and hallucination risk, we propose a multidimensional evaluation framework integrating semantic similarity metrics, LLM-as-a-judge scoring, and source attribution analysis. Experimental comparison between Gemma-3 27B and DeepSeek-R1 32B reveals that Gemma-3 achieves superior semantic fidelity and answer correctness, whereas DeepSeek-R1 exhibits marginally higher lexical matching performance. Our findings underscore the critical role of domain adaptation in enhancing the reliability of engineering AI assistants. The benchmark and evaluation methodology provide a reproducible, empirically grounded foundation for vertical-domain LLM assessment—advancing both practical deployment and methodological rigor in specialized AI evaluation.
📝 Abstract
Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge scoring. We also analyze consistency, judgment reliability, and hallucination through source attribution and score variance. Results show that Gemma excels in semantic fidelity and LLM-rated correctness, while DeepSeek demonstrates slightly higher lexical consistency. Additional findings highlight current limitations in telecom applications and the need for domain-adapted models to support trustworthy Artificial Intelligence (AI) assistants in engineering.