🤖 AI Summary
In general sound separation, a fundamental misalignment exists between signal-level optimization and semantic-level interference suppression, hindering models’ ability to suppress perceptually salient interference from acoustically similar sources. To address this, we propose a reinforcement learning–based multimodal semantic alignment framework: (1) modeling separation as a sequential decision-making task with a factorized Beta mask and group-wise relative advantage normalization; (2) introducing an audio-text-vision joint encoder to construct reliable multimodal rewards, enhanced via progressive alignment fine-tuning to improve cross-modal discriminability; and (3) incorporating clipped trust-region optimization, importance sampling, and entropy regularization. Evaluated across multiple benchmarks, our method consistently outperforms state-of-the-art approaches under text-, audio-, and image-based queries, achieving simultaneous improvements in both signal fidelity (SI-SNRi) and semantic quality (human evaluation and CLAP score).
📝 Abstract
Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.