🤖 AI Summary
A systematic, multidimensional evaluation framework—spanning linguistic understanding, logical reasoning, scientific knowledge, and ethical alignment—is currently lacking for assessing ChatGPT and GPT-4. Method: Through a systematic literature review and meta-evaluation of over 200 empirical studies, this work constructs the first comprehensive, cross-task and cross-disciplinary capability map for large language models (LLMs). It proposes a tripartite methodological principle for benchmarking—reproducibility, fairness, and multidimensionality—and establishes a taxonomy for LLM evaluation. Contribution/Results: The study identifies critical bottlenecks in mathematical reasoning, domain-specific question answering, and value alignment; exposes coverage biases and measurement limitations in existing benchmarks; and provides both theoretical foundations and practical guidelines for designing next-generation LLM evaluation frameworks.
📝 Abstract
The emergence of ChatGPT has generated much speculation in the press about its potential to disrupt social and economic systems. Its astonishing language ability has aroused strong curiosity among scholars about its performance in different domains. There have been many studies evaluating the ability of ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive review summarizing the collective assessment findings is lacking. The objective of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4, focusing on its language and reasoning abilities, scientific knowledge, and ethical considerations. Furthermore, an examination of the existing evaluation methods is conducted, offering several recommendations for future research.