Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study investigates alignment limitations of large language models (LLMs) when serving as answer generators, evaluators, and debaters on human-disputed topics—such as ethical dilemmas and value trade-offs—where no societal consensus exists. It challenges prevailing preference alignment paradigms that implicitly assume human agreement. Method: We introduce the first “no-consensus” benchmark, comprising manually annotated ambiguous scenarios, and conduct multi-model comparative experiments (GPT-4, Claude, Llama) with quantitative stance consistency measurement and qualitative open-response analysis. Contribution/Results: While LLMs generate nuanced answers as generators, they exhibit strong binary polarization as evaluators and debaters—averaging 76% stance rigidity, significantly exceeding observed human disagreement levels. We systematically demonstrate their failure to model human value pluralism, propose a novel three-role evaluation framework, and critically question foundational assumptions underlying scalable oversight and preference-based alignment.

Technology Category

Application Category

📝 Abstract

The increasing use of LLMs as substitutes for humans in ``aligning'' LLMs has raised questions about their ability to replicate human judgments and preferences, especially in ambivalent scenarios where humans disagree. This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater. These roles loosely correspond to previously described alignment frameworks: preference alignment (judge) and scalable oversight (debater), with the answer generator reflecting the typical setting with user interactions. We develop a ``no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios, each presenting two possible stances. Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters. These findings underscore the necessity for more sophisticated methods for aligning LLMs without human oversight, highlighting that LLMs cannot fully capture human disagreement even on topics where humans themselves are divided.

Problem

Research questions and friction points this paper is trying to address.

LLMs' ability to replicate human judgments in ambivalent scenarios

Biases and limitations of LLMs as answer generators, judges, and debaters

Need for better alignment methods to capture human disagreement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed a no-consensus benchmark for ambivalent scenarios

Evaluated LLMs as answer generators, judges, and debaters

Highlighted LLM biases in stance-taking without human oversight

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks