TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing general-purpose embedding evaluation benchmarks struggle to effectively assess the performance of retrieval-augmented generation (RAG) systems on telecommunications-domain texts, which are dense, replete with abbreviations, and heavily cross-referenced. To address this gap, this work introduces TeleEmbedBench—the first multi-corpus embedding benchmark tailored to the telecommunications domain—encompassing three heterogeneous corpora from O-RAN, 3GPP, and srsRAN, with 9,000 question–passage pairs. The authors propose a dual large language model (LLM)-based pipeline for automated query generation and validation. Key contributions include the release of a noisy subset, TeleEmbedBench-Clean, for robustness evaluation; empirical insights revealing opposing effects of domain-specific instructions on code versus natural language corpora; and evidence that LLM-based embedding models (e.g., Qwen3, EmbeddingGemma) substantially outperform traditional sentence-transformers in retrieval accuracy and resistance to cross-domain interference.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed in the telecommunications domain for critical tasks, relying heavily on Retrieval-Augmented Generation (RAG) to adapt general-purpose models to continuously evolving standards. However, a significant gap exists in evaluating the embedding models that power these RAG pipelines, as general-purpose benchmarks fail to capture the dense, acronym-heavy, and highly cross-referential nature of telecommunications corpora. To address this, we introduce TeleEmbedBench, the first large-scale, multi-corpus embedding benchmark designed specifically for telecommunications. The benchmark spans three heterogeneous corpora: O-RAN Alliance specifications, 3GPP release documents, and the srsRAN open-source codebase, comprising 9,000 question-chunk pairs across three standard chunk sizes (512, 1024, and 2048 tokens). To construct this dataset at scale without manual annotation bottlenecks, we employ a novel automated pipeline where one LLM generates specific queries from text chunks and a secondary LLM validates them across strict criteria. We comprehensively evaluate eight embedding models, spanning standard sentence-transformers and LLM-based embedders. Our results demonstrate that LLM-based embedders, such as Qwen3 and EmbeddingGemma, consistently and significantly outperform traditional sentence-transformers in both retrieval accuracy and robustness against cross-domain interference. Additionally, we introduce TeleEmbedBench-Clean to evaluate model robustness against noisy, incomplete user queries. Finally, our analysis reveals that while domain-specific task instructions improve embedder performance for raw source code, they paradoxically degrade retrieval performance for natural language telecommunications specifications.

Problem

Research questions and friction points this paper is trying to address.

embedding benchmark

telecommunications

Retrieval-Augmented Generation

domain-specific evaluation

cross-referential corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

TeleEmbedBench

Retrieval-Augmented Generation

embedding benchmark