CovertComBench: The First Domain-Specific Testbed for LLMs in Wireless Covert Communication

📅 2026-01-26
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of suitable benchmarks for evaluating large language models (LLMs) in the domain of wireless covert communication, where stringent security constraints—such as Kullback–Leibler (KL) divergence limits—are critical. To bridge this gap, the authors propose the first LLM-specific evaluation benchmark tailored to this field, encompassing tasks in conceptual understanding, optimization derivation, and code generation. They further introduce a novel automatic scoring mechanism grounded in detection theory, implementing an “LLM-as-Judge” framework. Experimental results reveal that while LLMs achieve strong performance in concept identification (81%) and code generation (83%), their accuracy drops significantly in security-critical mathematical derivations, ranging from 18% to 55%. These findings underscore the models’ limitations in high-order reasoning and affirm their role as assistive tools rather than autonomous solvers in safety-sensitive applications.

Technology Category

Application Category

📝 Abstract
The integration of Large Language Models (LLMs) into wireless networks presents significant potential for automating system design. However, unlike conventional throughput maximization, Covert Communication (CC) requires optimizing transmission utility under strict detection-theoretic constraints, such as Kullback-Leibler divergence limits. Existing benchmarks primarily focus on general reasoning or standard communication tasks and do not adequately evaluate the ability of LLMs to satisfy these rigorous security constraints. To address this limitation, we introduce CovertComBench, a unified benchmark designed to assess LLM capabilities across the CC pipeline, encompassing conceptual understanding (MCQs), optimization derivation (ODQs), and code generation (CGQs). Furthermore, we analyze the reliability of automated scoring within a detection-theoretic ``LLM-as-Judge''framework. Extensive evaluations across state-of-the-art models reveal a significant performance discrepancy. While LLMs achieve high accuracy in conceptual identification (81%) and code implementation (83%), their performance in the higher-order mathematical derivations necessary for security guarantees ranges between 18% and 55%. This limitation indicates that current LLMs serve better as implementation assistants rather than autonomous solvers for security-constrained optimization. These findings suggest that future research should focus on external tool augmentation to build trustworthy wireless AI systems.
Problem

Research questions and friction points this paper is trying to address.

Covert Communication
Large Language Models
Detection-theoretic Constraints
Security-constrained Optimization
Benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Covert Communication
Large Language Models
Domain-Specific Benchmark
Detection-Theoretic Constraints
LLM-as-Judge
🔎 Similar Papers
No similar papers found.
Z
Zhaozhi Liu
School of Computer Science, South-Central Minzu University, Wuhan 430074, China
J
Jiaxin Chen
School of Computer Science, South-Central Minzu University, Wuhan 430074, China
Y
Yuanai Xie
School of Computer Science, South-Central Minzu University, Wuhan 430074, China
Y
Yuna Jiang
School of Communications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210042, China
Minrui Xu
Minrui Xu
Nanyang Technological University
LLMs for NetworksQuantum InternetMetaverseNetwork EconomicsDRL
Xiao Zhang
Xiao Zhang
South-Central Minzu University
UAV NetworksAlgorithm Design and AnalysisComputational Intelligence
P
Pan Lai
School of Computer Science, South-Central Minzu University, Wuhan 430074, China
Z
Zan Zhou
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China