🤖 AI Summary
This work addresses three core tasks—instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation—without fine-tuning. Methodologically, it introduces a unified, token-based decoder-only Transformer that jointly conditions on MIDI sequences and CLAP-derived text/audio embeddings, generating audio autoregressively at the token level via a neural audio codec (e.g., SoundStream). Crucially, it achieves zero-shot cross-task generalization: a single model supports instrument cloning (from reference audio only), text-driven synthesis (e.g., “violin playing jazz”), and fine-grained timbre editing (e.g., “brighter and softer”). Quantitative and perceptual evaluations demonstrate state-of-the-art performance in audio quality, timbre similarity, and MIDI fidelity. The framework is fully open-sourced, including code, pretrained weights, and interactive demos, validating both its technical efficacy and practical utility.
📝 Abstract
Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth