Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a phoneme-level memorization vulnerability in lyric-to-song (LS2) and text-to-video generative models: when presented with adversarial prompts generated via homophonic substitution—preserving acoustic structure while altering semantics—these models faithfully reconstruct audiovisual content from their training data. We introduce the novel concept of “phoneme-to-visual regurgitation,” demonstrating for the first time that phonemic structure alone can trigger cross-modal reconstruction of full audiovisual sequences. Using adversarial phoneme prompting (APT), we systematically reproduce high-fidelity outputs across leading models—including SUNO, YuE, and Veo 3—validated through audio-specific metrics (CLAP, AudioJudge, CoverID) and multilingual, multi-style empirical evaluation. Our findings expose a critical memory leakage risk in multimodal generative systems, with significant implications for copyright compliance, provenance tracing, and AI safety.

Technology Category

Application Category

📝 Abstract
Lyrics-to-Song (LS2) generation models promise end-to-end music synthesis from text, yet their vulnerability to training data memorization remains underexplored. We introduce Adversarial PhoneTic Prompting (APT), a novel attack where lyrics are semantically altered while preserving their acoustic structure through homophonic substitutions (e.g., Eminem's famous "mom's spaghetti" $ ightarrow$ "Bob's confetti"). Despite these distortions, we uncover a powerful form of sub-lexical memorization: models like SUNO and YuE regenerate outputs strikingly similar to known training content, achieving high similarity across audio-domain metrics, including CLAP, AudioJudge, and CoverID. This vulnerability persists across multiple languages and genres. More surprisingly, we discover that phoneme-altered lyrics alone can trigger visual memorization in text-to-video models. When prompted with phonetically modified lyrics from Lose Yourself, Veo 3 reconstructs visual elements from the original music video -- including character appearance and scene composition -- despite no visual cues in the prompt. We term this phenomenon phonetic-to-visual regurgitation. Together, these findings expose a critical vulnerability in transcript-conditioned multimodal generation: phonetic prompting alone can unlock memorized audiovisual content, raising urgent questions about copyright, safety, and content provenance in modern generative systems. Example generations are available on our demo page (jrohsc.github.io/music_attack/).
Problem

Research questions and friction points this paper is trying to address.

Explores vulnerability of Lyrics-to-Song models to phonetic memorization attacks
Investigates phonetic-to-visual regurgitation in text-to-video generation models
Examines copyright and safety risks in multimodal generative systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial phonetic prompting for acoustic preservation
Sub-lexical memorization detection in audio models
Phonetic-to-visual regurgitation in video generation
🔎 Similar Papers
No similar papers found.