Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address shallow multimodal fusion and insufficient semantic understanding and reasoning in referring audio-visual segmentation (RAVS), this paper introduces the OmniAVS benchmark and the OISA model. OmniAVS is the first RAVS benchmark supporting joint referring across four modalities—text, speech, sound, and vision—and encompasses eight categories of complex referring expressions, explicitly incorporating reasoning-intensive samples requiring world knowledge to enhance task difficulty and real-world applicability. OISA is an instruction-driven multimodal large language model (MLLM) that enables cross-modal alignment and context-aware parsing, augmented with fine-grained audio-visual feature modeling. Experiments demonstrate that OISA achieves substantial improvements over prior methods on OmniAVS, attaining state-of-the-art or competitive performance not only on RAVS but also on related transfer tasks, thereby validating its capability for deep multimodal understanding and reasoning.

Technology Category

Application Category

📝 Abstract

Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

Problem

Research questions and friction points this paper is trying to address.

Integrating multimodal information for audio-visual segmentation

Understanding and reasoning about complex audiovisual content

Addressing fine-grained segmentation with multimodal expressions

Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniAVS dataset with multimodal expressions

OISA for reasoning-based segmentation

MLLM for complex cue comprehension

🔎 Similar Papers

No similar papers found.

Authors to Follow