Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses audio-guided image segmentation—localizing target objects in images directly from raw speech instructions, without relying on intermediate speech-to-text conversion. Methodologically, we propose an end-to-end audio-visual alignment paradigm and introduce the first dedicated audio grounding dataset covering diverse object categories and multilingual accents. We adopt an audio-visual contrastive learning framework to jointly optimize speech embedding and visual encoders, and benchmark our approach using state-of-the-art cross-modal pre-trained models. Experiments demonstrate that our method achieves accuracy comparable to or exceeding text-mediated baselines, with significantly improved robustness under accent variation and low-resource language conditions. These results validate the feasibility and practicality of direct, text-free speech-vision alignment for grounded visual understanding.

Technology Category

Application Category

📝 Abstract
Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.
Problem

Research questions and friction points this paper is trying to address.

Direct audio-visual alignment without text transcription
Grounding objects from single-word spoken instructions
Robustness to linguistic variability in audio instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct audio-visual alignment without text transcription
New dataset with diverse accents for audio grounding
Benchmarking models from audio-visual field for robustness
🔎 Similar Papers
No similar papers found.