🤖 AI Summary
This study investigates how diversity enhances the accuracy of large language models (LLMs) on binary question-answering tasks. We systematically compare two diversity strategies: *model diversity*—ensemble voting across multiple distinct LLMs—and *question interpretation diversity*—generating multiple paraphrased reformulations of the same question to elicit varied responses from a single model. Using majority-voting ensembles, we evaluate GPT and LLaMA variants on BoolQ, StrategyQA, and PubMedQA. Results show that question interpretation diversity consistently and significantly improves ensemble accuracy, whereas model diversity yields no stable performance gain. This challenges the conventional multi-model ensemble paradigm and reveals that prompting-induced internal representation diversity within a single LLM is more effective than architectural diversity across models. The findings offer a lightweight, parameter-efficient approach to enhancing LLM robustness through prompt engineering rather than model scaling or ensemble expansion.
📝 Abstract
Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.