🤖 AI Summary
This work addresses the lack of specialized evaluation benchmarks for assessing large language models’ (LLMs) mathematical reasoning capabilities in telecommunications—spanning signal processing, network optimization, and performance analysis. To this end, we introduce TeleMath, the first domain-specific benchmark comprising 500 numerically answerable problems covering core telecommunication mathematics. We propose a domain-expert-driven, scalable, structured Q&A generation pipeline and establish a zero-shot and few-shot multi-model evaluation framework. Our key contributions are: (1) the release of TeleMath—the first dedicated benchmark for telecom mathematics; (2) open-sourcing of the TeleMath dataset and evaluation code; and (3) empirical findings demonstrating substantial limitations of general-purpose LLMs on these tasks, while math- and logic-specialized models achieve up to a 32.7% absolute accuracy gain.
📝 Abstract
The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.