CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address semantic misalignment in audio-to-image generation caused by homophones and auditory illusions, this paper proposes a dual-mechanism framework comprising EXPrompt Mining and a Selector module. First, a large language model collaborates with an audio captioning model to generate diverse, semantically rich prompts. Subsequently, a multimodal filtering and retrieval strategy adaptively selects optimal class-level and instance-level prompts. Finally, a lightweight mapping network bridges the generated prompts to a pre-trained text-to-image diffusion model without fine-tuning the base generator. This approach significantly improves cross-modal alignment fidelity while preserving generation efficiency. Experiments across multiple audio classification benchmarks demonstrate substantial gains in both semantic consistency and visual quality of generated images, effectively mitigating generation bias induced by speech ambiguity.

Technology Category

Application Category

📝 Abstract

We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.

Problem

Research questions and friction points this paper is trying to address.

Mitigates semantic misalignment in audio-to-image generation

Addresses ambiguity from homographs and auditory illusions

Enhances generation quality via cross-modal prompt adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

EXPrompt Mining enriches cross-modal semantic prompts

EXPrompt Selector filters for optimal semantic alignment

Lightweight mapping network adapts text-to-image models

🔎 Similar Papers

No similar papers found.