š¤ AI Summary
Existing open-source tools struggle to efficiently and accurately convert standard e-books into audiobooks with synchronized text highlighting while preserving privacy, layout fidelity, and offline usability. This work proposes the first open-source framework for generating synchronized audiobooks based on the EPUB 3 Media Overlay standard. By integrating open neural text-to-speech systems such as XTTS-v2 and Chatterbox, the framework directly captures word-level timestamps during speech synthesis, eliminating the need for post-hoc forced alignment. The approach fully retains original document formatting and embedded media, operates entirely offline, and significantly enhances synchronization accuracy and reading experience. It effectively avoids reliance on external APIs, mitigates privacy leakage risks, and reduces potential copyright concerns.
š Abstract
A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git.