๐ค AI Summary
To address the low efficiency and high cost of manually producing Chinese audiobook commentary content, this paper proposes the first multi-agent collaborative generation system tailored for podcast-style audiobook interpretation. The method introduces a novel framework comprising 11 specialized agents, covering the end-to-end pipelineโfrom thematic mining and illustrative case extraction to logical structuring and colloquial script synthesis. It tightly integrates large language models (LLMs) with text-to-speech (TTS) technologies, incorporating modules for thematic analysis, case-based reasoning, editorial refinement, iterative factual verification, and speech synthesis. Experimental results demonstrate that the generated commentary scripts significantly outperform human-expert versions in conciseness and factual accuracy, though speech naturalness remains an area for improvement. This work establishes a new paradigm for high-quality, scalable automation of spoken-content production.
๐ Abstract
Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system's output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.