Weight Ensembling Improves Reasoning in Language Models

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Reasoning models often suffer from diversity collapse during supervised fine-tuning (SFT), causing persistent improvement in Pass@1 but severe degradation in Pass@k—thereby limiting test-time scaling performance. Method: This paper first identifies the bias–variance trade-off underlying Pass@k degradation in reasoning tasks and proposes WiSE-FT, a weight interpolation method that linearly combines early and late SFT checkpoints to jointly optimize bias and variance—without additional training or data. Contribution/Results: WiSE-FT effectively recovers and surpasses the original Pass@k performance while simultaneously improving Pass@1. It enhances test-time scaling strategies—including majority voting and Best@k—and achieves superior generalization with fewer labeled examples. Experimental results demonstrate consistent gains across diverse reasoning benchmarks, breaking the inherent trade-off imposed by conventional decoding policies.

Technology Category

Application Category

📝 Abstract

We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades-off between bias and variance.

Problem

Research questions and friction points this paper is trying to address.

Addresses diversity collapse in reasoning model training

Improves Pass@k and Pass@1 via weight interpolation

Reduces bias and variance in test-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight interpolation recovers reasoning diversity

WiSE-FT improves test-time scaling significantly

Combines early and latest checkpoints effectively

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting