Device-Guided Music Transfer

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing music transformation methods focus solely on style or content transfer, neglecting the decisive impact of playback hardware—particularly speaker frequency response characteristics—on perceived audio quality. Method: We propose the first device-adaptive music re-rendering framework: (1) modeling speaker frequency response curves as learnable embeddings to enable parametric representation of hardware properties; (2) leveraging vision-language models to parse frequency response plots and extract semantic device features; and (3) designing a hybrid Transformer architecture with feature-level linear modulation to jointly support cross-device audio fidelity transfer and stylistic adaptation. Contribution/Results: After fine-tuning on a self-constructed dataset, our method demonstrates strong generalization to unseen devices under few-shot settings. It significantly improves audio fidelity preservation and robustness to device variation, thereby transcending the conventional paradigm that treats music transformation in isolation from playback hardware constraints.

Technology Category

Application Category

📝 Abstract
Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker's frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.
Problem

Research questions and friction points this paper is trying to address.

Adapts music playback for unseen devices lacking user access
Extracts device embeddings using vision-language models on frequency responses
Enables speaker-style transfer and few-shot adaptation for hardware diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language model for device embeddings
Employs hybrid transformer with feature modulation
Enables few-shot adaptation for unseen devices
🔎 Similar Papers
No similar papers found.