🤖 AI Summary
Existing text-to-audio (TTA) models struggle to capture temporal, causal, and co-occurrence relationships among sound events described in text, resulting in generated audio lacking relational fidelity. To address this, we propose the first systematic solution for audio event relationship modeling: (1) we construct the first real-world benchmark dataset with fine-grained relational annotations—comprising both a curated event corpus and a corresponding audio collection—and establish a multi-dimensional evaluation framework; (2) we design a lightweight, plug-and-play relational-aware fine-tuning framework that integrates structured relation supervision, multi-granularity event modeling, and a cross-modal relational alignment loss. Extensive experiments on mainstream TTA models demonstrate significant improvements in relational fidelity, temporal consistency, and semantic coherence, with average relative gains of 12.6% across multiple subtasks. Our code and dataset are publicly released.
📝 Abstract
Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA