🤖 AI Summary
This study rigorously evaluates whether GPT-4-Turbo satisfies Turing’s original 1950 definition of the Turing Test, thereby assessing its capacity for human-level cognition.
Method: For the first time, we fully reconstruct Turing’s tripartite “imitation game” — including the canonical Man-Imitates-Woman control condition — and conduct a strict CIHG (Controlled Imitation Human-Guessing) evaluation: human judges perform double-blind, real-time conversational interrogation without time constraints, followed by statistical significance analysis.
Contribution/Results: All but one judge correctly distinguished the LLM from human participants, demonstrating that even the state-of-the-art large language model fails the stringent Turing Test. This work delivers the most historically faithful empirical validation to date of Turing’s original proposal, systematically refutes overclaims regarding LLMs’ “thinking” capabilities, and establishes a new benchmark for rigorous AI capability assessment.
📝 Abstract
The current cycle of hype and anxiety concerning the benefits and risks to human society of Artificial Intelligence is fuelled, not only by the increasing use of generative AI and other AI tools by the general public, but also by claims made on behalf of such technology by popularizers and scientists. In particular, recent studies have claimed that Large Language Models (LLMs) can pass the Turing Test-a goal for AI since the 1950s-and therefore can"think". Large-scale impacts on society have been predicted as a result. Upon detailed examination, however, none of these studies has faithfully applied Turing's original instructions. Consequently, we conducted a rigorous Turing Test with GPT-4-Turbo that adhered closely to Turing's instructions for a three-player imitation game. We followed established scientific standards where Turing's instructions were ambiguous or missing. For example, we performed a Computer-Imitates-Human Game (CIHG) without constraining the time duration and conducted a Man-Imitates-Woman Game (MIWG) as a benchmark. All but one participant correctly identified the LLM, showing that one of today's most advanced LLMs is unable to pass a rigorous Turing Test. We conclude that recent extravagant claims for such models are unsupported, and do not warrant either optimism or concern about the social impact of thinking machines.