๐ค AI Summary
Traditional Best-of-N decoding selects only the single highest-scoring candidate, discarding complementary information present in other candidates. This work proposes Fusion-of-N (FusioN), the first method to employ a large language model (LLM) as a judge that synthesizes salient information across N generated candidates into a single high-quality outputโshifting the paradigm from selective filtering to collaborative synthesis. FusioN integrates test-time multiple sampling, multi-teacher synthetic data distillation, and cross-lingual multi-task evaluation. It consistently outperforms Best-of-N across 11 languages, three task categories (e.g., reasoning, translation, summarization), and multiple model scales, yielding significant improvements in both generation quality and downstream task performance. Empirical results demonstrate that FusioN effectively harnesses the multifaceted value of LLM outputs, validating the robustness and generalizability of information fusion over candidate selection.
๐ Abstract
Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.