Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap

📅 2022-11-23

📈 Citations: 1

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This study investigates the capacity of current AI systems to emulate human behavior in language and vision tasks—and whether they can pass a Turing-test–style discrimination task. To this end, we introduce the first large-scale cross-modal Turing test benchmark, encompassing six tasks: image captioning, word association, dialogue, object detection, color estimation, and saliency prediction. Our evaluation employs a double-blind, randomized design, involving 549 human participants and outputs from 26 models, assessed jointly by human and AI discriminators. Key contributions include: (1) the first systematic visual–linguistic joint Turing test; (2) empirical evidence that anthropomorphism correlates weakly with conventional metrics (e.g., BLEU, mAP); (3) discovery that lightweight AI discriminators achieve significantly higher accuracy (62–71%) than humans (error rate 35–48%); (4) formalization of “anthropomorphism” as an independent evaluation dimension; and (5) open-sourcing of the benchmark dataset and standardized evaluation protocol.

📝 Abstract

As AI algorithms increasingly participate in daily activities, it becomes critical to ascertain whether the agents we interact with are human or not. To address this question, we turn to the Turing test and systematically benchmark current AIs in their abilities to imitate humans in three language tasks (Image captioning, Word association, and Conversation) and three vision tasks (Object detection, Color estimation, and Attention prediction). The experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges, in 25,650 Turing-like tests. The results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges. While human judges were often deceived, simple AI judges outperformed human judges in distinguishing human answers from AI answers. The results of imitation tests are only minimally correlated with standard performance metrics in AI. Thus, evaluating whether a machine can pass as a human constitutes an important independent test to evaluate AI algorithms. The curated, large-scale, Turing datasets introduced here and their evaluation metrics provide new benchmarks and insights to assess whether an agent is human or not and emphasize the relevance of rigorous, systematic, and quantitative imitation tests in these and other AI domains.

Problem

Research questions and friction points this paper is trying to address.

Assessing AI's human imitation in language and vision tasks

Benchmarking AI agents against human performance in Turing tests

Evaluating human-likeness as independent AI performance criterion

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI imitation tests across language and vision tasks

Large-scale Turing datasets for human-likeness benchmarking

Independent evaluation metrics beyond conventional performance measures

🔎 Similar Papers

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms