Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 11
Influential: 1
📄 PDF
🤖 AI Summary
Small-scale large language models (LLMs) exhibit limited performance on complex reasoning tasks such as mathematical problem solving. To address this, we propose a training-free, inference-time ensemble framework: multiple semantically and structurally diverse prompts are generated in parallel to elicit heterogeneous reasoning paths from a single lightweight LLM (e.g., Qwen2-MATH-1.5B-it), followed by majority voting and self-consistency–based fusion for robust aggregation. This work introduces the first “prompt-diversity-driven” training-free LLM ensemble paradigm, circumventing the conventional reliance on parameter diversity across multiple models. Empirically, our three-prompt parallel ensemble achieves a 6.2% absolute accuracy gain over the larger Qwen2-MATH-7B-it on the MATH benchmark—demonstrating superior performance despite lower parameter count—while maintaining controllable inference latency.

Technology Category

Application Category

📝 Abstract
Large Language Models still encounter substantial challenges in reasoning tasks, especially for smaller models, which many users may be restricted to due to resource constraints (e.g. GPU memory restrictions). Inference-time methods to boost LLM performance, such as prompting methods to invoke certain reasoning pathways in responses, have been shown effective in past works, though they largely rely on sequential queries. The ensemble method, which consists of multiple constituent models running in parallel, is a promising approach to achieving better inference-time performance, especially given recent developments that enabled significant speed-ups in LLM batch inference. In this work, we propose a novel, training-free LLM ensemble framework where a single LLM model is fed an optimized, diverse set of prompts in parallel, effectively producing an ensemble at inference time to achieve performance improvement in reasoning tasks. We empirically demonstrate that our method leads to significant gains on math reasoning tasks, e.g., on MATH, where our ensemble consisting of a few small models (e.g., three Qwen2-MATH-1.5B-it models) can outperform a larger model (e.g., Qwen2-MATH-7B-it).
Problem

Research questions and friction points this paper is trying to address.

Improving reasoning performance of small LLMs
Generating diverse prompts for parallel ensemble inference
Enhancing accuracy without additional model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse parallel prompts create ensemble reasoning paths
Training-free framework transforms single LLM into ensemble
Optimized prompt sets elicit varied reasoning for performance gains