🤖 AI Summary
Current text-to-audio models lack quantitative, interpretable metrics to assess output diversity and fidelity under fixed prompts—hindering their trustworthy deployment. To address this, we propose Expressiveness Range Analysis (ERA), the first systematic framework for evaluating text-to-audio generation models. ERA employs a standardized prompt set derived from ESC-50 and extracts acoustic features—including pitch, loudness, and MFCCs—to perform multidimensional, quantitative modeling of model outputs. It characterizes the output distribution and identifies expressive boundaries under prompt fixation, ensuring strong reproducibility. Experiments demonstrate that ERA effectively discriminates between models in terms of diversity and fidelity. By grounding evaluation in perceptually grounded acoustic attributes, ERA establishes the first explainable, comparable, and scalable assessment paradigm specifically designed for generative audio models.
📝 Abstract
Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target.
Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators' output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models.