Room Impulse Response Generation Conditioned on Acoustic Parameters

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing RIR generation methods heavily rely on room geometry priors, limiting their applicability in scenarios where layout is unknown or perceptual fidelity is paramount—such as VR and audio post-production. This work introduces the first framework for RIR generation conditioned solely on perceptual acoustic parameters (e.g., reverberation time, direct-to-reverberant ratio), eliminating dependence on explicit geometric modeling. Our method innovatively unifies autoregressive Transformers (within Descript Audio Codec), MaskGIT, flow matching, and classifier-guided sampling to jointly process discrete tokens and continuous embeddings. Objective metrics and subjective MOS evaluations demonstrate state-of-the-art performance; notably, the MaskGIT variant achieves superior flexibility in unseen environments and enhanced auditory realism.

Technology Category

Application Category

📝 Abstract

The generation of room impulse responses (RIRs) using deep neural networks has attracted growing research interest due to its applications in virtual and augmented reality, audio postproduction, and related fields. Most existing approaches condition generative models on physical descriptions of a room, such as its size, shape, and surface materials. However, this reliance on geometric information limits their usability in scenarios where the room layout is unknown or when perceptual realism (how a space sounds to a listener) is more important than strict physical accuracy. In this study, we propose an alternative strategy: conditioning RIR generation directly on a set of RIR acoustic parameters. These parameters include various measures of reverberation time and direct sound to reverberation ratio, both broadband and bandwise. By specifying how the space should sound instead of how it should look, our method enables more flexible and perceptually driven RIR generation. We explore both autoregressive and non-autoregressive generative models operating in the Descript Audio Codec domain, using either discrete token sequences or continuous embeddings. Specifically, we have selected four models to evaluate: an autoregressive transformer, the MaskGIT model, a flow matching model, and a classifier-based approach. Objective and subjective evaluations are performed to compare these methods with state-of-the-art alternatives. Results show that the proposed models match or outperform state-of-the-art alternatives, with the MaskGIT model achieving the best performance.

Problem

Research questions and friction points this paper is trying to address.

Generates room impulse responses using acoustic parameters not geometry

Focuses on perceptual realism over strict physical accuracy

Evaluates autoregressive and non-autoregressive models for RIR generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Condition RIR generation on acoustic parameters

Use autoregressive and non-autoregressive models

Evaluate four models including MaskGIT

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Sr Staff R&D Engineer

Disney

The hiring range for this position in Nicasio, CA is $206,400 to $276,700 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate’s geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered.

Nicasio, CA, USA

Authors to Follow