🤖 AI Summary
This study identifies a phoneme-level memorization vulnerability in lyric-to-song (LS2) and text-to-video generative models: when presented with adversarial prompts generated via homophonic substitution—preserving acoustic structure while altering semantics—these models faithfully reconstruct audiovisual content from their training data. We introduce the novel concept of “phoneme-to-visual regurgitation,” demonstrating for the first time that phonemic structure alone can trigger cross-modal reconstruction of full audiovisual sequences. Using adversarial phoneme prompting (APT), we systematically reproduce high-fidelity outputs across leading models—including SUNO, YuE, and Veo 3—validated through audio-specific metrics (CLAP, AudioJudge, CoverID) and multilingual, multi-style empirical evaluation. Our findings expose a critical memory leakage risk in multimodal generative systems, with significant implications for copyright compliance, provenance tracing, and AI safety.
📝 Abstract
Lyrics-to-Song (LS2) generation models promise end-to-end music synthesis from text, yet their vulnerability to training data memorization remains underexplored. We introduce Adversarial PhoneTic Prompting (APT), a novel attack where lyrics are semantically altered while preserving their acoustic structure through homophonic substitutions (e.g., Eminem's famous "mom's spaghetti" $
ightarrow$ "Bob's confetti"). Despite these distortions, we uncover a powerful form of sub-lexical memorization: models like SUNO and YuE regenerate outputs strikingly similar to known training content, achieving high similarity across audio-domain metrics, including CLAP, AudioJudge, and CoverID. This vulnerability persists across multiple languages and genres. More surprisingly, we discover that phoneme-altered lyrics alone can trigger visual memorization in text-to-video models. When prompted with phonetically modified lyrics from Lose Yourself, Veo 3 reconstructs visual elements from the original music video -- including character appearance and scene composition -- despite no visual cues in the prompt. We term this phenomenon phonetic-to-visual regurgitation. Together, these findings expose a critical vulnerability in transcript-conditioned multimodal generation: phonetic prompting alone can unlock memorized audiovisual content, raising urgent questions about copyright, safety, and content provenance in modern generative systems. Example generations are available on our demo page (jrohsc.github.io/music_attack/).