3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an efficient approach to audio-guided video object segmentation that eliminates the need for audio-specific training. By leveraging automatic speech recognition (ASR) to transcribe audio into text, the method repurposes pretrained text-guided video segmentation models to perform cross-modal inference from audio inputs to pixel-level masks. A key innovation is the introduction of a presence-aware gating mechanism, which effectively suppresses hallucinatory segmentations when the target object is absent, thereby significantly enhancing robustness. Evaluated on the MeViS-Audio benchmark, the proposed method achieved third place, demonstrating strong generalization capabilities and consistently reliable audio-guided segmentation performance.

Technology Category

Application Category

📝 Abstract
Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.
Problem

Research questions and friction points this paper is trying to address.

Audio-based Referring Video Object Segmentation
ARVOS
audio-visual grounding
spatio-temporal visual representations
pixel-level object masks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-based Referring Video Object Segmentation
ASR-to-text transfer
existence-aware gating
vision-language architecture
zero-shot audio adaptation
🔎 Similar Papers
No similar papers found.