Evaluating Open-Source Large Language Models for Technical Telecom Question Answering

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study addresses the insufficient evaluation of open-source large language models (LLMs) on telecom-domain question answering. We introduce the first benchmark comprising 105 technical question-answer pairs targeting advanced wireless communications, covering both factual and reasoning-based queries. To systematically assess correctness, consistency, and hallucination risk, we propose a multidimensional evaluation framework integrating semantic similarity metrics, LLM-as-a-judge scoring, and source attribution analysis. Experimental comparison between Gemma-3 27B and DeepSeek-R1 32B reveals that Gemma-3 achieves superior semantic fidelity and answer correctness, whereas DeepSeek-R1 exhibits marginally higher lexical matching performance. Our findings underscore the critical role of domain adaptation in enhancing the reliability of engineering AI assistants. The benchmark and evaluation methodology provide a reproducible, empirically grounded foundation for vertical-domain LLM assessment—advancing both practical deployment and methodological rigor in specialized AI evaluation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge scoring. We also analyze consistency, judgment reliability, and hallucination through source attribution and score variance. Results show that Gemma excels in semantic fidelity and LLM-rated correctness, while DeepSeek demonstrates slightly higher lexical consistency. Additional findings highlight current limitations in telecom applications and the need for domain-adapted models to support trustworthy Artificial Intelligence (AI) assistants in engineering.

Problem

Research questions and friction points this paper is trying to address.

Evaluating open-source LLMs for technical telecom question answering

Assessing performance on factual and reasoning telecom questions

Identifying limitations and needs for domain-adapted telecom AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated open-source LLMs for telecom QA

Assessed models using multi-dimensional benchmark metrics

Identified domain adaptation needs for reliable AI

🔎 Similar Papers

No similar papers found.