PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the scarcity of full-bandroom impulse response (RIR) data and the inability of existing models to generate acoustically accurate full-band RIRs from multimodal inputs, this paper proposes the first end-to-end text-to-full-band RIR cross-modal generation framework. Our method integrates a variational autoencoder (VAE) for band-limited RIR upsampling and introduces a latent-space correction flow-matching diffusion transformer conditioned on natural language to directly synthesize high-fidelity full-band RIRs. Compared to baseline approaches, our model reduces RT60 error to 8.8%—a 45.8-percentage-point improvement—while achieving state-of-the-art performance in both perceptual quality and acoustic parameter fidelity. This work represents the first breakthrough in text-driven full-band RIR generation, establishing a new paradigm for high-immersion acoustic modeling in virtual auditory environments.

Technology Category

Application Category

📝 Abstract

Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.

Problem

Research questions and friction points this paper is trying to address.

Generating full-band room impulse responses from limited datasets

Creating acoustically accurate RIRs from diverse input modalities

Improving RIR quality for virtual reality and audio applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Upsamples band-limited RIRs using variational autoencoder

Generates RIRs from text via conditional diffusion transformer

Employs rectified flow matching for acoustic accuracy

🔎 Similar Papers

No similar papers found.