🤖 AI Summary
This work addresses the challenge of reconstructing full-sphere head-related impulse responses (HRIRs) from sparse individualized measurements by proposing the first time-domain, end-to-end, grid-free binaural Transformer model. Departing from conventional frequency-domain approaches that rely on minimum-phase assumptions and fixed directional grids, the proposed method integrates sinusoidal spatial encoding, a Conv1D refinement module, and multi-task auxiliary heads for interaural time and level differences (ITD/ILD) to directly predict complete HRIRs at arbitrary directions from limited measurements. Evaluated on the SONICOM dataset, the model significantly outperforms existing methods in terms of normalized mean squared error (NMSE), cosine similarity, and ITD/ILD estimation accuracy, while simultaneously enhancing temporal fidelity and spatial continuity.
📝 Abstract
Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose BiFormer3D, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary.