Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Speech-driven open-vocabulary object detection aims to localize and identify unseen object categories directly from speech input, yet remains hindered by the scarcity of audio–image paired data and reliance on text-based intermediaries in existing approaches. This paper introduces Speech2See, an end-to-end framework that eliminates textual bridging and enables direct speech-to-visual grounding. Its core contributions are: (1) a learnable query-guided semantic aggregation module that strengthens cross-modal alignment between speech and image features; and (2) a parameter-efficient Mixture-of-LoRA-Experts (MoLE) architecture that enhances generalization and adaptation capability. Adopting a pretrain-fine-tune paradigm, Speech2See achieves state-of-the-art performance across multiple benchmarks, with significant improvements in robustness, cross-category generalization, and practical deployability.

Technology Category

Application Category

📝 Abstract

Audio grounding, or speech-driven open-set object detection, aims to localize and identify objects directly from speech, enabling generalization beyond predefined categories. This task is crucial for applications like human-robot interaction where textual input is impractical. However, progress in this domain faces a fundamental bottleneck from the scarcity of large-scale, paired audio-image data, and is further constrained by previous methods that rely on indirect, text-mediated pipelines. In this paper, we introduce Speech-to-See (Speech2See), an end-to-end approach built on a pre-training and fine-tuning paradigm. Specifically, in the pre-training stage, we design a Query-Guided Semantic Aggregation module that employs learnable queries to condense redundant speech embeddings into compact semantic representations. During fine-tuning, we incorporate a parameter-efficient Mixture-of-LoRA-Experts (MoLE) architecture to achieve deeper and more nuanced cross-modal adaptation. Extensive experiments show that Speech2See achieves robust and adaptable performance across multiple benchmarks, demonstrating its strong generalization ability and broad applicability.

Problem

Research questions and friction points this paper is trying to address.

Localizing and identifying objects directly from speech input

Overcoming scarcity of large-scale paired audio-image datasets

Eliminating reliance on indirect text-mediated detection pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end speech-driven object detection pipeline

Query-guided semantic aggregation for speech embeddings

Mixture-of-LoRA-Experts for cross-modal adaptation

🔎 Similar Papers

A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training