PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current open-ended long-audio generation (e.g., podcast synthesis) lacks unified, reliable, and multidimensional evaluation standards, facing challenges including absence of reference ground truth, inconsistent metrics, and high subjectivity. This paper introduces PodEval—the first open-source, multimodal evaluation framework specifically designed for podcast generation. It establishes a realistic podcast dataset and decomposes assessment into three orthogonal dimensions: textual semantics, vocal delivery, and audio fidelity—integrating automated metrics (e.g., WER, PESQ, estimated MOS) with structured human evaluation. Its key innovation lies in the first systematic realization of content–format disentangled evaluation, enabling fair, apples-to-apples comparison across both open- and closed-source models. Extensive experiments validate PodEval’s effectiveness on human-produced podcasts, open-source models (e.g., Fish-Speech), and commercial systems. The framework provides a reproducible, extensible, and standardized benchmark for long-audio generation evaluation.

Technology Category

Application Category

📝 Abstract
Recently, an increasing number of multimodal (text and audio) benchmarks have emerged, primarily focusing on evaluating models' understanding capability. However, exploration into assessing generative capabilities remains limited, especially for open-ended long-form content generation. Significant challenges lie in no reference standard answer, no unified evaluation metrics and uncontrollable human judgments. In this work, we take podcast-like audio generation as a starting point and propose PodEval, a comprehensive and well-designed open-source evaluation framework. In this framework: 1) We construct a real-world podcast dataset spanning diverse topics, serving as a reference for human-level creative quality. 2) We introduce a multimodal evaluation strategy and decompose the complex task into three dimensions: text, speech and audio, with different evaluation emphasis on "Content" and "Format". 3) For each modality, we design corresponding evaluation methods, involving both objective metrics and subjective listening test. We leverage representative podcast generation systems (including open-source, close-source, and human-made) in our experiments. The results offer in-depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval in evaluating open-ended long-form audio. This project is open-source to facilitate public use: https://github.com/yujxx/PodEval.
Problem

Research questions and friction points this paper is trying to address.

Evaluating generative capabilities for open-ended long-form content
Addressing lack of reference standards and unified metrics
Developing multimodal evaluation framework for podcast audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed real-world podcast dataset for reference
Introduced multimodal evaluation strategy across three dimensions
Designed objective metrics and subjective listening tests
🔎 Similar Papers
No similar papers found.
Yujia Xiao
Yujia Xiao
The Chinese University of Hong Kong
Speech
Liumeng Xue
Liumeng Xue
Hong Kong University of Science and Technology
Audio Speech and Language ProcessingSpeech Generation
L
Lei He
Microsoft, China
X
Xinyi Chen
South China University of Technology, China
A
Aemon Yat Fei Chiu
The Chinese University of Hong Kong, Hong Kong, China
Wenjie Tian
Wenjie Tian
Northwest Polytechnical University
speech generation
Shaofei Zhang
Shaofei Zhang
Microsoft, China
Qiuqiang Kong
Qiuqiang Kong
The Chinese University of Hong Kong
Audio ProcessingArtificial Intelligence
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
W
Wei Xue
The Hong Kong University of Science and Technology, Hong Kong, China
T
Tan Lee
The Chinese University of Hong Kong, Hong Kong, China