SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Consumer-grade videos commonly lack binaural audio, limiting immersive spatial auditory experiences. This work proposes a vision-guided framework for monaural-to-binaural audio reconstruction that explicitly predicts left and right channels using visual cues. The method introduces a dual-head self-attention mechanism to jointly generate a shared scene graph and end-to-end channel-wise attention, complemented by an annealed soft spatial prior and a two-stage confidence-weighted waveform-domain fusion strategy—eliminating the need for handcrafted masks or task-specific annotations. Built upon a ViT encoder with multi-crop window aggregation, the approach achieves significant improvements in time-frequency metrics, phase-sensitive measures, and signal-to-noise ratio on the FAIR-Play and MUSIC-Stereo datasets, effectively suppressing inter-channel crosstalk.

Technology Category

Application Category

📝 Abstract

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

Problem

Research questions and friction points this paper is trying to address.

binaural audio

spatial audio

mono-to-binaural

audio reconstruction

audio-visual

Innovation

Methods, ideas, or system contributions that make the work stand out.

binaural audio

vision-guided audio

self-attention