DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Controllable text-to-audio generation faces an inherent trade-off among temporal localization accuracy, open-vocabulary scalability, and inference efficiency. To address this, we propose the Dynamic Event Graph (DEG) modeling framework, which unifies event semantics, temporal structure, and cross-event relationships into a learnable graph representation, and integrates it with a graph-transformer-driven diffusion model for fine-grained spatiotemporal control. Our approach introduces three key innovations: (i) DEG-guided generation, (ii) hierarchical event annotation, and (iii) consensus preference optimization—jointly enhancing temporal consistency without compromising audio diversity. Evaluated on AudioCondition, DESED, and AudioTime benchmarks, our method achieves state-of-the-art performance in both objective metrics (e.g., FAD, STOI) and subjective MOS scores, significantly outperforming prior works.

Technology Category

Application Category

📝 Abstract
Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Controllable text-to-audio generation with precise temporal constraints
Balancing accurate localization, vocabulary scalability, and practical efficiency
Generating audio with structured event relationships and temporal attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic event graph-guided diffusion transformer framework
Quality-balanced data selection with hierarchical annotation
Consensus preference optimization using multiple reward signals
🔎 Similar Papers
Y
Yisu Liu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China
C
Chenxing Li
Tencent, AI Lab, Beijing 100089, China
W
Wanqian Zhang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China
W
Wenfu Wang
Tencent, AI Lab, Beijing 100089, China
M
Meng Yu
Tencent, AI Lab, Bellevue, WA 98004, USA
Ruibo Fu
Ruibo Fu
Associate Professor,CASIA
AIGCLMMIntelligent speech interactionDeepfake detection
Z
Zheng Lin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security
D
Dong Yu
Tencent, AI Lab, Bellevue, WA 98004, USA