Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the persistent challenges in text-to-audio (T2A) generation, particularly in handling complex prompts and achieving precise semantic alignment. The authors propose a novel paradigm that leverages a large language model to produce high-fidelity audio descriptions and, for the first time, applies Group Relative Policy Optimization (GRPO) to fine-tune a diffusion-based Transformer T2A model via reinforcement learning. By systematically exploring combinations of multidimensional reward functions—including CLAP similarity, KL divergence, and Fréchet Audio Distance (FAD)—the approach significantly enhances both audio fidelity and adherence to input text. Experimental results demonstrate substantial improvements over existing methods on complex prompts, underscoring the effectiveness and potential of reinforcement learning in advancing T2A generation.

Technology Category

Application Category

📝 Abstract

Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.

Problem

Research questions and friction points this paper is trying to address.

text-to-audio generation

text-audio alignment

complex audio effects

synthesis fidelity

prompt adherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Relative Policy Optimization

Diffusion Transformer

Text-to-Audio Generation