Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

📅 2024-06-12

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address internal bias in large language model (LLM) evaluation of highly subjective tasks—such as affective intelligence and creative writing—arising from reliance on single-model adjudication, this paper proposes a democratized, multi-LLM collaborative evaluation paradigm. We construct an inclusive deliberative committee comprising 20 heterogeneous LLMs that jointly perform question generation, response generation, and open-ended peer review. Robustness is enhanced via Monte Carlo sampling of random subcommittees, and results are validated against human annotations. By replacing individual-model judgment with collective consensus, our approach significantly improves ranking separation and evaluation robustness on affective intelligence tasks. Empirical results demonstrate that the proposed LMC (Large Model Committee) framework achieves substantially higher agreement with human judgments than single-model baselines—including GPT-4o—thereby advancing principled, scalable, and human-aligned evaluation of subjective LLM capabilities.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.

Problem

Research questions and friction points this paper is trying to address.

Addressing intra-model bias in LLM evaluations

Evaluating subjective tasks like emotional intelligence

Implementing democratic LLM collaboration for fair rankings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Democratic LLM evaluation system

Monte Carlo simulations for cost

Hand-curated sub-councils composition

🔎 Similar Papers

No similar papers found.