Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how diversity enhances the accuracy of large language models (LLMs) on binary question-answering tasks. We systematically compare two diversity strategies: *model diversity*—ensemble voting across multiple distinct LLMs—and *question interpretation diversity*—generating multiple paraphrased reformulations of the same question to elicit varied responses from a single model. Using majority-voting ensembles, we evaluate GPT and LLaMA variants on BoolQ, StrategyQA, and PubMedQA. Results show that question interpretation diversity consistently and significantly improves ensemble accuracy, whereas model diversity yields no stable performance gain. This challenges the conventional multi-model ensemble paradigm and reveals that prompting-induced internal representation diversity within a single LLM is more effective than architectural diversity across models. The findings offer a lightweight, parameter-efficient approach to enhancing LLM robustness through prompt engineering rather than model scaling or ensemble expansion.

Technology Category

Application Category

📝 Abstract
Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.
Problem

Research questions and friction points this paper is trying to address.

Comparing model diversity vs question interpretation diversity in LLMs
Evaluating ensemble accuracy using majority voting for binary questions
Analyzing performance of GPT and LLaMa with diverse ensembling approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares model diversity and question interpretation diversity
Uses majority voting for ensemble consensus heuristic
Question interpretation diversity improves ensemble accuracy
🔎 Similar Papers
No similar papers found.