🤖 AI Summary
To address semantic misalignment in audio-to-image generation caused by homophones and auditory illusions, this paper proposes a dual-mechanism framework comprising EXPrompt Mining and a Selector module. First, a large language model collaborates with an audio captioning model to generate diverse, semantically rich prompts. Subsequently, a multimodal filtering and retrieval strategy adaptively selects optimal class-level and instance-level prompts. Finally, a lightweight mapping network bridges the generated prompts to a pre-trained text-to-image diffusion model without fine-tuning the base generator. This approach significantly improves cross-modal alignment fidelity while preserving generation efficiency. Experiments across multiple audio classification benchmarks demonstrate substantial gains in both semantic consistency and visual quality of generated images, effectively mitigating generation bias induced by speech ambiguity.
📝 Abstract
We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.