Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

📅 2024-12-11
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of jointly conditioning high-fidelity audio generation on interpretable time-varying control signals (e.g., loudness, brightness, pitch) and text prompts, while supporting sketch-based control for vocal onomatopoeia or acoustic contours. Methodologically, we propose a lightweight linear control adapter—requiring only a single linear layer and 40k fine-tuning steps—and introduce stochastic median filtering to robustly model multi-granularity onomatopoeic timing. We unify text, time-varying signals, and sketch conditions within a latent diffusion Transformer (DiT) framework. Experiments demonstrate that our approach significantly improves fidelity to reference acoustic contours while preserving textual semantic consistency and audio quality—outperforming text-only baselines. The method establishes an efficient, artist-centric paradigm for fine-grained acoustic creation with precise, intuitive control.

Technology Category

Application Category

📝 Abstract
We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation. Sound examples are available at https://hugofloresgarcia.art/sketch2sound/.
Problem

Research questions and friction points this paper is trying to address.

Generates audio from interpretable control signals and text
Synthesizes sounds using vocal imitations or reference shapes
Lightweight model with minimal fine-tuning and control layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates audio from time-varying control signals
Uses sonic imitations for sound synthesis
Lightweight fine-tuning with linear layers
🔎 Similar Papers
No similar papers found.