🤖 AI Summary
This work proposes the Lexical Acoustic Coding (LAC) framework, which for the first time employs natural language simultaneously as both a semantic descriptor and transmission medium for audio signals. Leveraging pretrained large language models, LAC constructs a sender that parses audio waveforms into interpretable acoustic descriptors and quantizes them into text, and a receiver that generates acoustic constraints from this text and reconstructs the waveform via closed-loop optimization. Without any additional training, LAC achieves end-to-end acoustic encoding and decoding under fixed prompts, revealing inherent trade-offs among vocabulary size, bitrate, and reconstruction fidelity. Experiments demonstrate that purely textual representations can effectively preserve measurable acoustic structures in short audio clips and symbolic music tasks, offering interpretability, editability, and seamless compatibility with large language model–based communication paradigms.
📝 Abstract
Natural language is widely used to describe, prompt, and control audio systems, but rarely serves as the representation carrying audio itself. We introduce lexical acoustic coding (LAC), a framework in which pre-trained LLM sender and receiver agents transmit sound through natural language. Under fixed system prompts, the agents write their own analysis and synthesis code, communicating only through a lexical sentence, shared vocabulary, and optional symbolic music structure. The sender analyzes an input waveform into interpretable, non-learned acoustic descriptors, quantizes each with a feature-specific interval vocabulary, and verbalizes the lexical code as English. The receiver parses the sentence back into lexical-acoustic constraints and renders a waveform through closed-loop refinement. The transmitted text serves as both a rich caption and as the transport representation itself. We frame LAC as a finite-rate lossy quantizer, exposing trade-offs between vocabulary size, rate, and fidelity. Experiments on short sounds and symbolic music transfer show that plain text preserves measurable acoustic structure while remaining interpretable, editable, and native to LLM-mediated communication.