TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the semantic gap between users’ natural-language descriptions of desired audio effects and the underlying signal processing parameters in digital audio workstations. To bridge this gap, the authors propose a texture-resonance retrieval (TRR) framework for editable audio effect control, which leverages intermediate-layer activations from Wav2Vec2 to construct Gram matrices that capture co-activation texture structures. This enables precise mapping from natural language queries to editable effect presets. Notably, this work introduces Gram matrix–guided texture-aware representations into audio effect retrieval for the first time, prioritizing preset editability over mere waveform generation. A leakage-proof evaluation protocol is also designed to ensure methodological rigor. Evaluated on a benchmark of 1,063 guitar effect presets, TRR achieves the lowest normalized parameter error and demonstrates perceptual efficacy in a listening study with 26 participants.

Technology Category

Application Category

📝 Abstract
Digital audio workstations expose rich effect chains, yet a semantic gap remains between perceptual user intent and low-level signal-processing parameters. We study retrieval-grounded audio effect control, where the output is an editable plugin configuration rather than a finalized waveform. Our focus is Texture Resonance Retrieval (TRR), an audio representation built from Gram matrices of projected mid-level Wav2Vec2 activations. This design preserves texture-relevant co-activation structure. We evaluate TRR on a guitar-effects benchmark with 1,063 candidate presets and 204 queries. The evaluation follows Protocol-A, a cross-validation scheme that prevents train-test leakage. We compare TRR against CLAP and internal retrieval baselines (Wav2Vec-RAG, Text-RAG, FeatureNN-RAG), using min-max normalized metrics grounded in physical DSP parameter ranges. Ablation studies validate TRR's core design choices: projection dimensionality, layer selection, and projection type. A near-duplicate sensitivity analysis confirms that results are robust to trivial knowledge-base matches. TRR achieves the lowest normalized parameter error among evaluated methods. A multiple-stimulus listening study with 26 participants provides complementary perceptual evidence. We interpret these results as benchmark evidence that texture-aware retrieval is useful for editable audio effect control, while broader personalization and real-audio robustness claims remain outside the verified evidence presented here.
Problem

Research questions and friction points this paper is trying to address.

audio effect control
semantic gap
retrieval
executable configuration
perceptual intent
Innovation

Methods, ideas, or system contributions that make the work stand out.

Texture Resonance Retrieval
Gram matrix
audio effect control
Wav2Vec2
retrieval-based audio editing
S
Shihao He
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
Y
Yihan Xia
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
Fang Liu
Fang Liu
Computer Science and Engineering, Nanjing University of Science and Technology
Deep learningImage ProcessingRemote SensingSARPolSAR
Taotao Wang
Taotao Wang
Shenzhen University
Blockchain and Blockchain NetworksWireless Communications and Networking
S
Shengli Zhang
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China