6G-Bench: An Open Benchmark for Semantic Communication and Network-Level Reasoning with Foundation Models in AI-Native 6G Networks

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the lack of standardized evaluation benchmarks for semantic communication and network-level reasoning in AI-native 6G systems. We propose 6G-Bench, the first framework to systematically formalize 6G decision-making tasks from standards bodies such as 3GPP and IETF into an evaluable reasoning benchmark. It encompasses five capability categories and thirty tasks, generating 10,000 multiple-choice questions via task-conditioned prompting, with a high-confidence subset of 3,722 expert-validated items. The benchmark supports long-context inputs (up to 1M tokens), multi-step quantitative reasoning, and a worst-case regret minimization mechanism, enabling integrated training and evaluation. Evaluation across 22 foundation models reveals pass@1 accuracy ranging from 0.22 to 0.82, with top models achieving 0.87–0.89 on intent and strategy reasoning, and pass@5 scores of 0.20–0.91 on complex tasks. The dataset is publicly released to foster open research.

Technology Category

Application Category

📝 Abstract

This paper introduces 6G-Bench, an open benchmark for evaluating semantic communication and network-level reasoning in AI-native 6G networks. 6G-Bench defines a taxonomy of 30 decision-making tasks (T1--T30) extracted from ongoing 6G and AI-agent standardization activities in 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, and organizes them into five standardization-aligned capability categories. Starting from 113,475 scenarios, we generate a balanced pool of 10,000 very-hard multiple-choice questions using task-conditioned prompts that enforce multi-step quantitative reasoning under uncertainty and worst-case regret minimization over multi-turn horizons. After automated filtering and expert human validation, 3,722 questions are retained as a high-confidence evaluation set, while the full pool is released to support training and fine-tuning of 6G-specialized models. Using 6G-Bench, we evaluate 22 foundation models spanning dense and mixture-of-experts architectures, short- and long-context designs (up to 1M tokens), and both open-weight and proprietary systems. Across models, deterministic single-shot accuracy (pass@1) spans a wide range from 0.22 to 0.82, highlighting substantial variation in semantic reasoning capability. Leading models achieve intent and policy reasoning accuracy in the range 0.87--0.89, while selective robustness analysis on reasoning-intensive tasks shows pass@5 values ranging from 0.20 to 0.91. To support open science and reproducibility, we release the 6G-Bench dataset on GitHub: https://github.com/maferrag/6G-Bench

Problem

Research questions and friction points this paper is trying to address.

semantic communication

network-level reasoning

AI-native 6G

benchmark

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic communication

network-level reasoning

foundation models