Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study investigates how diversity enhances the accuracy of large language models (LLMs) on binary question-answering tasks. We systematically compare two diversity strategies: *model diversity*—ensemble voting across multiple distinct LLMs—and *question interpretation diversity*—generating multiple paraphrased reformulations of the same question to elicit varied responses from a single model. Using majority-voting ensembles, we evaluate GPT and LLaMA variants on BoolQ, StrategyQA, and PubMedQA. Results show that question interpretation diversity consistently and significantly improves ensemble accuracy, whereas model diversity yields no stable performance gain. This challenges the conventional multi-model ensemble paradigm and reveals that prompting-induced internal representation diversity within a single LLM is more effective than architectural diversity across models. The findings offer a lightweight, parameter-efficient approach to enhancing LLM robustness through prompt engineering rather than model scaling or ensemble expansion.

Technology Category

Application Category

📝 Abstract

Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.

Problem

Research questions and friction points this paper is trying to address.

Comparing model diversity vs question interpretation diversity in LLMs

Evaluating ensemble accuracy using majority voting for binary questions

Analyzing performance of GPT and LLaMa with diverse ensembling approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares model diversity and question interpretation diversity

Uses majority voting for ensemble consensus heuristic

Question interpretation diversity improves ensemble accuracy

🔎 Similar Papers

No similar papers found.