Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

📅 2024-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address internal bias in large language model (LLM) evaluation of highly subjective tasks—such as affective intelligence and creative writing—arising from reliance on single-model adjudication, this paper proposes a democratized, multi-LLM collaborative evaluation paradigm. We construct an inclusive deliberative committee comprising 20 heterogeneous LLMs that jointly perform question generation, response generation, and open-ended peer review. Robustness is enhanced via Monte Carlo sampling of random subcommittees, and results are validated against human annotations. By replacing individual-model judgment with collective consensus, our approach significantly improves ranking separation and evaluation robustness on affective intelligence tasks. Empirical results demonstrate that the proposed LMC (Large Model Committee) framework achieves substantially higher agreement with human judgments than single-model baselines—including GPT-4o—thereby advancing principled, scalable, and human-aligned evaluation of subjective LLM capabilities.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.
Problem

Research questions and friction points this paper is trying to address.

Addressing intra-model bias in LLM evaluations
Evaluating subjective tasks like emotional intelligence
Implementing democratic LLM collaboration for fair rankings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Democratic LLM evaluation system
Monte Carlo simulations for cost
Hand-curated sub-councils composition
🔎 Similar Papers
No similar papers found.
Justin Zhao
Justin Zhao
Independent, Ex-Google, Ex-Predibase
F
F. Plaza-del-Arco
Bocconi University
B
Benjie Genchel
A
A. C. Curry
Bocconi University