🤖 AI Summary
This work proposes an efficient approach to audio-guided video object segmentation that eliminates the need for audio-specific training. By leveraging automatic speech recognition (ASR) to transcribe audio into text, the method repurposes pretrained text-guided video segmentation models to perform cross-modal inference from audio inputs to pixel-level masks. A key innovation is the introduction of a presence-aware gating mechanism, which effectively suppresses hallucinatory segmentations when the target object is absent, thereby significantly enhancing robustness. Evaluated on the MeViS-Audio benchmark, the proposed method achieved third place, demonstrating strong generalization capabilities and consistently reliable audio-guided segmentation performance.
📝 Abstract
Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.