Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Gencho, a diffusion Transformer-based model for blind deconvolution in unknown acoustic environments, addressing the limited modeling capacity of existing methods and their incompatibility with generative audio applications. Gencho employs a structure-aware encoder to disentangle early and late reflection components from reverberant speech and predicts room impulse responses (RIRs) in complex spectrogram form. By incorporating text-conditioned control, the model enables diverse and perceptually realistic acoustic environment generation. Gencho supports modular integration and text-guided synthesis, outperforming non-generative baselines in both subjective listening quality and objective metrics, and demonstrates successful application in text-driven acoustic simulation tasks.

Technology Category

Application Category

📝 Abstract
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho's versatility for controllable acoustic simulation and generative audio tasks.
Problem

Research questions and friction points this paper is trying to address.

Room Impulse Response
Blind Estimation
Generative Audio
Acoustic Simulation
Reverberant Speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer
Room Impulse Response (RIR) generation
Reverberant speech
Text-conditioned audio synthesis
Structure-aware encoding
🔎 Similar Papers
No similar papers found.