🤖 AI Summary
This work addresses the lack of systematic evaluation resources for large language models (LLMs) in generating structured, context-aware podcast scripts, particularly under long-context (up to 21K tokens) and multi-speaker instruction scenarios. We introduce PodBench, the first comprehensive benchmark tailored for audio-oriented podcast script generation, comprising 800 complex samples and a multidimensional evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Our experiments reveal a notable inconsistency between instruction adherence and content substance, and demonstrate that explicit reasoning mechanisms substantially enhance the robustness of open-source models in handling long-context coherence and multi-speaker coordination. PodBench provides a reproducible evaluation platform for audio-centric long-form text generation.
📝 Abstract
Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.