PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation resources for large language models (LLMs) in generating structured, context-aware podcast scripts, particularly under long-context (up to 21K tokens) and multi-speaker instruction scenarios. We introduce PodBench, the first comprehensive benchmark tailored for audio-oriented podcast script generation, comprising 800 complex samples and a multidimensional evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Our experiments reveal a notable inconsistency between instruction adherence and content substance, and demonstrate that explicit reasoning mechanisms substantially enhance the robustness of open-source models in handling long-context coherence and multi-speaker coordination. PodBench provides a reproducible evaluation platform for audio-centric long-form text generation.

Technology Category

Application Category

📝 Abstract

Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.

Problem

Research questions and friction points this paper is trying to address.

podcast script generation

instruction following

long-context generation

multi-speaker dialogue

audio-oriented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Podcast script generation

instruction-aware evaluation

long-context modeling

multi-speaker dialogue

LLM benchmarking

🔎 Similar Papers

AudioBench: A Universal Benchmark for Audio Large Language Models

2024-06-23arXiv.orgCitations: 17