Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection

πŸ“… 2025-09-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Speech-driven open-vocabulary object detection aims to localize and identify unseen object categories directly from speech input, yet remains hindered by the scarcity of audio–image paired data and reliance on text-based intermediaries in existing approaches. This paper introduces Speech2See, an end-to-end framework that eliminates textual bridging and enables direct speech-to-visual grounding. Its core contributions are: (1) a learnable query-guided semantic aggregation module that strengthens cross-modal alignment between speech and image features; and (2) a parameter-efficient Mixture-of-LoRA-Experts (MoLE) architecture that enhances generalization and adaptation capability. Adopting a pretrain-fine-tune paradigm, Speech2See achieves state-of-the-art performance across multiple benchmarks, with significant improvements in robustness, cross-category generalization, and practical deployability.

Technology Category

Application Category

πŸ“ Abstract
Audio grounding, or speech-driven open-set object detection, aims to localize and identify objects directly from speech, enabling generalization beyond predefined categories. This task is crucial for applications like human-robot interaction where textual input is impractical. However, progress in this domain faces a fundamental bottleneck from the scarcity of large-scale, paired audio-image data, and is further constrained by previous methods that rely on indirect, text-mediated pipelines. In this paper, we introduce Speech-to-See (Speech2See), an end-to-end approach built on a pre-training and fine-tuning paradigm. Specifically, in the pre-training stage, we design a Query-Guided Semantic Aggregation module that employs learnable queries to condense redundant speech embeddings into compact semantic representations. During fine-tuning, we incorporate a parameter-efficient Mixture-of-LoRA-Experts (MoLE) architecture to achieve deeper and more nuanced cross-modal adaptation. Extensive experiments show that Speech2See achieves robust and adaptable performance across multiple benchmarks, demonstrating its strong generalization ability and broad applicability.
Problem

Research questions and friction points this paper is trying to address.

Localizing and identifying objects directly from speech input
Overcoming scarcity of large-scale paired audio-image datasets
Eliminating reliance on indirect text-mediated detection pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end speech-driven object detection pipeline
Query-guided semantic aggregation for speech embeddings
Mixture-of-LoRA-Experts for cross-modal adaptation
πŸ”Ž Similar Papers
No similar papers found.
Wenhuan Lu
Wenhuan Lu
Tianjin University
Speech
X
Xinyue Song
College of Intelligence and Computing, Tianjin University, Tianjin, China
Wenjun Ke
Wenjun Ke
Southeast University
Natural Language Processing
Z
Zhizhi Yu
College of Intelligence and Computing, Tianjin University, Tianjin, China
W
Wenhao Yang
College of Intelligence and Computing, Tianjin University, Tianjin, China
Jianguo Wei
Jianguo Wei
Tianjin university
Speech ProductionSpeech ProcessingArtificial medical intelligence