🤖 AI Summary
Consumer-grade videos commonly lack binaural audio, limiting immersive spatial auditory experiences. This work proposes a vision-guided framework for monaural-to-binaural audio reconstruction that explicitly predicts left and right channels using visual cues. The method introduces a dual-head self-attention mechanism to jointly generate a shared scene graph and end-to-end channel-wise attention, complemented by an annealed soft spatial prior and a two-stage confidence-weighted waveform-domain fusion strategy—eliminating the need for handcrafted masks or task-specific annotations. Built upon a ViT encoder with multi-crop window aggregation, the approach achieves significant improvements in time-frequency metrics, phase-sensitive measures, and signal-to-noise ratio on the FAIR-Play and MUSIC-Stereo datasets, effectively suppressing inter-channel crosstalk.
📝 Abstract
Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.