PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation resources for large language models (LLMs) in generating structured, context-aware podcast scripts, particularly under long-context (up to 21K tokens) and multi-speaker instruction scenarios. We introduce PodBench, the first comprehensive benchmark tailored for audio-oriented podcast script generation, comprising 800 complex samples and a multidimensional evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Our experiments reveal a notable inconsistency between instruction adherence and content substance, and demonstrate that explicit reasoning mechanisms substantially enhance the robustness of open-source models in handling long-context coherence and multi-speaker coordination. PodBench provides a reproducible evaluation platform for audio-centric long-form text generation.

Technology Category

Application Category

📝 Abstract
Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.
Problem

Research questions and friction points this paper is trying to address.

podcast script generation
instruction following
long-context generation
multi-speaker dialogue
audio-oriented generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Podcast script generation
instruction-aware evaluation
long-context modeling
multi-speaker dialogue
LLM benchmarking
🔎 Similar Papers
No similar papers found.
C
Chenning Xu
Large Language Model Department, Tencent
M
Mao Zheng
Large Language Model Department, Tencent
Mingyu Zheng
Mingyu Zheng
Institute of Information Engineering, CAS
NLPTable UnderstandingLLMs
Mingyang Song
Mingyang Song
Tencent Inc.
NLPIRLLMs