Rethinking LLM Ensembling from the Perspective of Mixture Models

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the high computational cost of traditional large language model (LLM) ensembles, which require forward passes through all constituent models to compute the aggregated output distribution. To overcome this inefficiency, the authors propose a Mixture-model-like Ensemble (ME) approach that reformulates LLM ensembling as a mixture model, wherein a single model is stochastically selected at each decoding step to generate the next token—thereby circumventing explicit computation of the full ensemble distribution. This perspective offers the first formal connection between LLM ensembles and token-level routing, revealing that standard ensembling is a special case of ME. Theoretical analysis establishes probabilistic equivalence between ME and conventional ensembles in terms of output distributions, while experiments demonstrate 1.78–2.68× faster inference and substantially reduced computational overhead.

📝 Abstract

Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://github.com/jialefu/Mixture-model-like-Ensemble/.

Problem

Research questions and friction points this paper is trying to address.

LLM ensembling

computational efficiency

mixture models

token-level routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture Model

LLM Ensembling

Token-level Routing

Efficient Inference