π€ AI Summary
This work addresses the limitation of the traditional Turing test, which is confined to one-on-one interactions and thus ill-suited for evaluating the human-likeness of large language models (LLMs) in group settings. The authors propose βTuringHotel,β a novel distributed Turing test framework that facilitates time-limited discussions within mixed groups of humans and AI agents. This approach introduces the first symmetric, decentralized paradigm for group interaction, wherein participants simultaneously serve as both judges and subjects. Built on the UNaIVERSE platform, the system enables authenticated peer-to-peer communication, cross-device interface consistency, and programmable role logic, supporting scalable, longitudinal evaluation. In experiments involving 17 human participants and 19 LLM instances, some AI agents were still misidentified as human; however, distinctive interaction fingerprints unique to humans remained discernible, indicating that current models have not yet fully replicated human-like group behavior.
π Abstract
In this paper, we report our experience with ``TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a ``World'' which defines the roles and interaction dynamics, facilitated by the platform's built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.