PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of full-bandroom impulse response (RIR) data and the inability of existing models to generate acoustically accurate full-band RIRs from multimodal inputs, this paper proposes the first end-to-end text-to-full-band RIR cross-modal generation framework. Our method integrates a variational autoencoder (VAE) for band-limited RIR upsampling and introduces a latent-space correction flow-matching diffusion transformer conditioned on natural language to directly synthesize high-fidelity full-band RIRs. Compared to baseline approaches, our model reduces RT60 error to 8.8%—a 45.8-percentage-point improvement—while achieving state-of-the-art performance in both perceptual quality and acoustic parameter fidelity. This work represents the first breakthrough in text-driven full-band RIR generation, establishing a new paradigm for high-immersion acoustic modeling in virtual auditory environments.

Technology Category

Application Category

📝 Abstract
Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
Problem

Research questions and friction points this paper is trying to address.

Generating full-band room impulse responses from limited datasets
Creating acoustically accurate RIRs from diverse input modalities
Improving RIR quality for virtual reality and audio applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Upsamples band-limited RIRs using variational autoencoder
Generates RIRs from text via conditional diffusion transformer
Employs rectified flow matching for acoustic accuracy
🔎 Similar Papers
No similar papers found.