🤖 AI Summary
This study investigates the capacity of current AI systems to emulate human behavior in language and vision tasks—and whether they can pass a Turing-test–style discrimination task. To this end, we introduce the first large-scale cross-modal Turing test benchmark, encompassing six tasks: image captioning, word association, dialogue, object detection, color estimation, and saliency prediction. Our evaluation employs a double-blind, randomized design, involving 549 human participants and outputs from 26 models, assessed jointly by human and AI discriminators. Key contributions include: (1) the first systematic visual–linguistic joint Turing test; (2) empirical evidence that anthropomorphism correlates weakly with conventional metrics (e.g., BLEU, mAP); (3) discovery that lightweight AI discriminators achieve significantly higher accuracy (62–71%) than humans (error rate 35–48%); (4) formalization of “anthropomorphism” as an independent evaluation dimension; and (5) open-sourcing of the benchmark dataset and standardized evaluation protocol.
📝 Abstract
As AI algorithms increasingly participate in daily activities, it becomes critical to ascertain whether the agents we interact with are human or not. To address this question, we turn to the Turing test and systematically benchmark current AIs in their abilities to imitate humans in three language tasks (Image captioning, Word association, and Conversation) and three vision tasks (Object detection, Color estimation, and Attention prediction). The experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges, in 25,650 Turing-like tests. The results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges. While human judges were often deceived, simple AI judges outperformed human judges in distinguishing human answers from AI answers. The results of imitation tests are only minimally correlated with standard performance metrics in AI. Thus, evaluating whether a machine can pass as a human constitutes an important independent test to evaluate AI algorithms. The curated, large-scale, Turing datasets introduced here and their evaluation metrics provide new benchmarks and insights to assess whether an agent is human or not and emphasize the relevance of rigorous, systematic, and quantitative imitation tests in these and other AI domains.