SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

📅 2024-04-02
🏛️ IEEE Open Journal of Signal Processing
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving fine-grained controllability and temporal coherence in music generation, this paper proposes SMITIN: a lightweight, training-free inference-time intervention framework. SMITIN introduces self-supervised classification probes—logistic regression models—trained on small-scale audio feature annotations to dynamically monitor outputs of individual attention heads in autoregressive music Transformers. It then performs real-time, attribute-specific interventions—e.g., drum presence and timbral authenticity—via a dynamic thresholding mechanism that balances intervention strength against sequence coherence. SMITIN pioneers a “probe-driven + plug-and-play attention-layer” intervention paradigm. Evaluated on audio continuation and text-to-music tasks, it significantly improves control accuracy (e.g., +37% Drum Recall) and achieves a subjective Mean Opinion Score (MOS) of 4.1+, demonstrating both computational efficiency and strong cross-task generalizability.

Technology Category

Application Category

📝 Abstract
We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes. These simple logistic regression probes are trained on the output of each attention head in the transformer using a small dataset of audio examples both exhibiting and missing a specific musical trait (e.g., the presence/absence of drums, or real/synthetic music). We then steer the attention heads in the probe direction, ensuring the generative model output captures the desired musical trait. Additionally, we monitor the probe output to avoid adding an excessive amount of intervention into the autoregressive generation, which could lead to temporally incoherent music. We validate our results objectively and subjectively for both audio continuation and text-to-music applications, demonstrating the ability to add controls to large generative models for which retraining or even fine-tuning is impractical for most musicians. Audio samples of the proposed intervention approach are available on our demo page http://tinyurl.com/smitin .
Problem

Research questions and friction points this paper is trying to address.

Automatic Music Generation
Real-time Adjustment
Quality Control
Innovation

Methods, ideas, or system contributions that make the work stand out.

SMITIN
Real-time Parameter Adjustment
Targeted Training
🔎 Similar Papers
No similar papers found.