Expressive Range Characterization of Open Text-to-Audio Models

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current text-to-audio models lack quantitative, interpretable metrics to assess output diversity and fidelity under fixed prompts—hindering their trustworthy deployment. To address this, we propose Expressiveness Range Analysis (ERA), the first systematic framework for evaluating text-to-audio generation models. ERA employs a standardized prompt set derived from ESC-50 and extracts acoustic features—including pitch, loudness, and MFCCs—to perform multidimensional, quantitative modeling of model outputs. It characterizes the output distribution and identifies expressive boundaries under prompt fixation, ensuring strong reproducibility. Experiments demonstrate that ERA effectively discriminates between models in terms of diversity and fidelity. By grounding evaluation in perceptually grounded acoustic attributes, ERA establishes the first explainable, comparable, and scalable assessment paradigm specifically designed for generative audio models.

Technology Category

Application Category

📝 Abstract

Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators' output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models.

Problem

Research questions and friction points this paper is trying to address.

Characterizing output variability and fidelity of text-to-audio models

Adapting expressive range analysis to evaluate generative audio systems

Quantifying acoustic dimensions of generated audio from standardized prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted expressive range analysis to text-to-audio models

Used fixed prompts from Environmental Sound Classification dataset

Analyzed audio across pitch, loudness, and timbre dimensions

🔎 Similar Papers

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions