🤖 AI Summary
Existing general-purpose embedding evaluation benchmarks struggle to effectively assess the performance of retrieval-augmented generation (RAG) systems on telecommunications-domain texts, which are dense, replete with abbreviations, and heavily cross-referenced. To address this gap, this work introduces TeleEmbedBench—the first multi-corpus embedding benchmark tailored to the telecommunications domain—encompassing three heterogeneous corpora from O-RAN, 3GPP, and srsRAN, with 9,000 question–passage pairs. The authors propose a dual large language model (LLM)-based pipeline for automated query generation and validation. Key contributions include the release of a noisy subset, TeleEmbedBench-Clean, for robustness evaluation; empirical insights revealing opposing effects of domain-specific instructions on code versus natural language corpora; and evidence that LLM-based embedding models (e.g., Qwen3, EmbeddingGemma) substantially outperform traditional sentence-transformers in retrieval accuracy and resistance to cross-domain interference.
📝 Abstract
Large language models (LLMs) are increasingly deployed in the telecommunications domain for critical tasks, relying heavily on Retrieval-Augmented Generation (RAG) to adapt general-purpose models to continuously evolving standards. However, a significant gap exists in evaluating the embedding models that power these RAG pipelines, as general-purpose benchmarks fail to capture the dense, acronym-heavy, and highly cross-referential nature of telecommunications corpora. To address this, we introduce TeleEmbedBench, the first large-scale, multi-corpus embedding benchmark designed specifically for telecommunications. The benchmark spans three heterogeneous corpora: O-RAN Alliance specifications, 3GPP release documents, and the srsRAN open-source codebase, comprising 9,000 question-chunk pairs across three standard chunk sizes (512, 1024, and 2048 tokens). To construct this dataset at scale without manual annotation bottlenecks, we employ a novel automated pipeline where one LLM generates specific queries from text chunks and a secondary LLM validates them across strict criteria. We comprehensively evaluate eight embedding models, spanning standard sentence-transformers and LLM-based embedders. Our results demonstrate that LLM-based embedders, such as Qwen3 and EmbeddingGemma, consistently and significantly outperform traditional sentence-transformers in both retrieval accuracy and robustness against cross-domain interference. Additionally, we introduce TeleEmbedBench-Clean to evaluate model robustness against noisy, incomplete user queries. Finally, our analysis reveals that while domain-specific task instructions improve embedder performance for raw source code, they paradoxically degrade retrieval performance for natural language telecommunications specifications.