🤖 AI Summary
Existing music transformation methods focus solely on style or content transfer, neglecting the decisive impact of playback hardware—particularly speaker frequency response characteristics—on perceived audio quality.
Method: We propose the first device-adaptive music re-rendering framework: (1) modeling speaker frequency response curves as learnable embeddings to enable parametric representation of hardware properties; (2) leveraging vision-language models to parse frequency response plots and extract semantic device features; and (3) designing a hybrid Transformer architecture with feature-level linear modulation to jointly support cross-device audio fidelity transfer and stylistic adaptation.
Contribution/Results: After fine-tuning on a self-constructed dataset, our method demonstrates strong generalization to unseen devices under few-shot settings. It significantly improves audio fidelity preservation and robustness to device variation, thereby transcending the conventional paradigm that treats music transformation in isolation from playback hardware constraints.
📝 Abstract
Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker's frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.