🤖 AI Summary
This study empirically examines whether large language models (LLMs) can pass a standardized double-blind, three-party controlled Turing test. Method: A preregistered randomized controlled trial assessed human judges’ misattribution rates—i.e., classifying LLMs as human—during 5-minute real-time dialogues across four systems: ELIZA, GPT-4o, LLaMA-3.1-405B, and GPT-4.5. All models were evaluated under both baseline and anthropomorphic prompting conditions. Contribution/Results: We provide the first rigorous empirical evidence of LLMs passing this Turing test paradigm: under anthropomorphic prompting, GPT-4.5 achieved a human-identification rate of 73%, significantly exceeding the human baseline; LLaMA-3.1-405B reached 56%, statistically indistinguishable from humans; in contrast, ELIZA (23%) and GPT-4o (21%) performed significantly below chance. These findings underscore the critical role of anthropomorphic prompting in shaping perceived intelligence and establish a novel benchmark for evaluating the societal implications of AI anthropomorphism.
📝 Abstract
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.