Evaluating Open-Source Large Language Models for Technical Telecom Question Answering

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the insufficient evaluation of open-source large language models (LLMs) on telecom-domain question answering. We introduce the first benchmark comprising 105 technical question-answer pairs targeting advanced wireless communications, covering both factual and reasoning-based queries. To systematically assess correctness, consistency, and hallucination risk, we propose a multidimensional evaluation framework integrating semantic similarity metrics, LLM-as-a-judge scoring, and source attribution analysis. Experimental comparison between Gemma-3 27B and DeepSeek-R1 32B reveals that Gemma-3 achieves superior semantic fidelity and answer correctness, whereas DeepSeek-R1 exhibits marginally higher lexical matching performance. Our findings underscore the critical role of domain adaptation in enhancing the reliability of engineering AI assistants. The benchmark and evaluation methodology provide a reproducible, empirically grounded foundation for vertical-domain LLM assessment—advancing both practical deployment and methodological rigor in specialized AI evaluation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge scoring. We also analyze consistency, judgment reliability, and hallucination through source attribution and score variance. Results show that Gemma excels in semantic fidelity and LLM-rated correctness, while DeepSeek demonstrates slightly higher lexical consistency. Additional findings highlight current limitations in telecom applications and the need for domain-adapted models to support trustworthy Artificial Intelligence (AI) assistants in engineering.
Problem

Research questions and friction points this paper is trying to address.

Evaluating open-source LLMs for technical telecom question answering
Assessing performance on factual and reasoning telecom questions
Identifying limitations and needs for domain-adapted telecom AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated open-source LLMs for telecom QA
Assessed models using multi-dimensional benchmark metrics
Identified domain adaptation needs for reliable AI
🔎 Similar Papers
No similar papers found.