A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated evaluation methods for generative content lack a systematic, cross-modal framework. Method: This paper conducts a large-scale literature review and cross-modal comparative analysis to establish, for the first time, a unified evaluation taxonomy covering text, image, and speech modalities. It identifies five fundamental evaluation paradigms and empirically validates their consistent applicability across three representative generative tasks. Furthermore, it introduces a comparability analysis framework to construct a structured knowledge graph that clarifies capability boundaries and limitations of existing methods per modality. Contributions/Results: (1) The first cross-modal unified classification system for generative evaluation; (2) abstraction of generalizable, transferable evaluation paradigms; and (3) a theoretical foundation and practical methodology for cross-modal consistent evaluation and joint metric design. This work bridges critical gaps in evaluating multimodal generative models and enables principled, interoperable assessment across modalities.

Technology Category

Application Category

📝 Abstract
Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.
Problem

Research questions and friction points this paper is trying to address.

Lack of systematic framework for evaluating text, visual, and audio outputs
Need for unified taxonomy of automatic evaluation methods across modalities
Identifying fundamental paradigms for cross-modal evaluation approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive review of evaluation methods
Unified taxonomy for text, visual, audio
Five fundamental paradigms identified
🔎 Similar Papers
No similar papers found.
T
Tian Lan
Beijing Institute of Technology, China
Y
Yang-Hao Zhou
Beijing Institute of Technology, China
Z
Zi-Ao Ma
Beijing Institute of Technology, China
F
Fanshu Sun
Beijing Institute of Technology, China
R
Rui-Qing Sun
Beijing Institute of Technology, China
J
Junyu Luo
Peking University, China
Rong-Cheng Tu
Rong-Cheng Tu
Nanyang Technological University
Image and Video RetrievalCross-modal RetrievalDeep Learning
H
Heyan Huang
Beijing Institute of Technology, China
C
Chen Xu
Beijing Institute of Technology, China
Zhijing Wu
Zhijing Wu
Beijing Institute of Technology
Information RetrievalNatural Language Processing
Xian-Ling Mao
Xian-Ling Mao
Beijing Institute of Technology
Web Data MiningInformation ExtractionQA & DialogueTopic ModelingLearn to Hashing