🤖 AI Summary
Existing controllable music generation methods often require model retraining or introduce audible artifacts, compromising either control precision or audio fidelity. This paper proposes MusicRFM, the first framework to integrate Recursive Feature Machines (RFMs) into music generation control. It employs lightweight probes to analyze hidden-state gradients of a pretrained MusicGen model, identifying interpretable, semantically meaningful directions—such as those corresponding to notes or chords—in the activation space. During inference, directional control signals are injected in real time without fine-tuning or iterative optimization. The method supports dynamic, time-varying scheduling and multi-attribute joint constraints, enhancing control granularity while preserving text prompt fidelity. Experiments show that target note accuracy improves significantly—from 0.23 to 0.82—while text adherence drops by only 0.02 relative to the uncontrolled baseline, achieving a unified trade-off between high controllability and high-fidelity audio generation.
📝 Abstract
Controllable music generation remains a significant challenge, with existing methods often requiring model retraining or introducing audible artifacts. We introduce MusicRFM, a framework that adapts Recursive Feature Machines (RFMs) to enable fine-grained, interpretable control over frozen, pre-trained music models by directly steering their internal activations. RFMs analyze a model's internal gradients to produce interpretable "concept directions", or specific axes in the activation space that correspond to musical attributes like notes or chords. We first train lightweight RFM probes to discover these directions within MusicGen's hidden states; then, during inference, we inject them back into the model to guide the generation process in real-time without per-step optimization. We present advanced mechanisms for this control, including dynamic, time-varying schedules and methods for the simultaneous enforcement of multiple musical properties. Our method successfully navigates the trade-off between control and generation quality: we can increase the accuracy of generating a target musical note from 0.23 to 0.82, while text prompt adherence remains within approximately 0.02 of the unsteered baseline, demonstrating effective control with minimal impact on prompt fidelity. We release code to encourage further exploration on RFMs in the music domain.