TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses three core tasks—instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation—without fine-tuning. Methodologically, it introduces a unified, token-based decoder-only Transformer that jointly conditions on MIDI sequences and CLAP-derived text/audio embeddings, generating audio autoregressively at the token level via a neural audio codec (e.g., SoundStream). Crucially, it achieves zero-shot cross-task generalization: a single model supports instrument cloning (from reference audio only), text-driven synthesis (e.g., “violin playing jazz”), and fine-grained timbre editing (e.g., “brighter and softer”). Quantitative and perceptual evaluations demonstrate state-of-the-art performance in audio quality, timbre similarity, and MIDI fidelity. The framework is fully open-sourced, including code, pretrained weights, and interactive demos, validating both its technical efficacy and practical utility.

Technology Category

Application Category

📝 Abstract

Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth

Problem

Research questions and friction points this paper is trying to address.

Neural synthesizer for audio generation

Instrument cloning and text-to-instrument synthesis

Text-guided timbre manipulation without fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-based neural synthesizer

Decoder-only transformer audio generation

CLAP embedding for timbre control

🔎 Similar Papers

No similar papers found.