Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This paper identifies a pervasive “swarm intelligence” phenomenon in large language models (LLMs) during open-ended generation: outputs across diverse models exhibit high inter-model convergence, severely undermining response diversity and potentially fostering human cognitive homogenization. To address this, the authors propose the first taxonomy for diversity evaluation under open-ended prompts and introduce Infinity-Chat—a large-scale, human-annotated benchmark comprising 26K real-world queries, 17 fine-grained task categories, and 31,250 annotations with absolute ratings and pairwise comparisons. Empirical analysis reveals pronounced internal and external mode collapse: model generations are highly similar across architectures, and models struggle to capture individualized human preferences. This work establishes novel methodological frameworks, provides the first comprehensive diversity benchmark, and issues a timely warning regarding the risks of preference misalignment, safety implications, and societal impacts of low-diversity AI systems.

Technology Category

Application Category

📝 Abstract

Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LM output diversity beyond narrow tasks

Characterizing open-ended prompts with comprehensive taxonomy

Studying mode collapse and inter-model homogeneity effects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infinity-Chat dataset with 26K diverse user queries

Comprehensive taxonomy for open-ended LM prompts

Large-scale study revealing Artificial Hivemind effect

🔎 Similar Papers

How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs

2024-07-12arXiv.orgCitations: 0