🤖 AI Summary
Existing audio generation methods often rely on model retraining or computationally expensive inference-time guidance to achieve fine-grained control, struggling to balance precision and efficiency. This work proposes a low-resource latent-space guidance approach that enables precise, multi-dimensional control over attributes such as intensity, pitch, and tempo directly within the latent variable space of a diffusion model. By integrating a selective Targeted Feature Guidance (TFG) strategy with lightweight Latent Control Heads (LatCHs), the method achieves high-quality audio synthesis with only 7M additional parameters and approximately four hours of training on the Stable Audio Open model. The approach significantly reduces computational overhead while outperforming conventional end-to-end guidance techniques in both controllability and generation quality.
📝 Abstract
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.