π€ AI Summary
This work addresses a critical gap in existing multimodal benchmarks, which predominantly focus on synchronized multimodal inputs while overlooking the modelβs ability to actively retrieve cross-modal evidence starting solely from audio. We introduce the first fully multimodal deep search task that takes audio as the sole initial input, requiring models to invoke text, image, and video search tools based on auditory cues and perform multi-hop reasoning to produce verifiable answers. A multi-stage filtering mechanism ensures that each sample exhibits strong audio dependency, necessitates retrieval, and admits a unique answer. Combining manual curation with automated pipelines, we construct a high-quality evaluation set comprising 640 samples spanning 15 fine-grained scenarios. Evaluation reveals that even the strongest current model, Gemini-1.5-Pro, achieves only 43.44% accuracy, highlighting audio-based entity understanding and cross-modal retrieval as key bottlenecks.
π Abstract
Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.