From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the prevailing reliance on scaling data and model parameters for improving large language model (LLM) performance, this paper proposes Aggregation Fine-Tuning (AFT). AFT employs supervised fine-tuning to teach models to end-to-end aggregate multiple draft responses into an optimized final answer; during inference, it adopts a “generate–aggregate” iterative strategy, tightly coupling test-time compute scaling with the training paradigm. Crucially, AFT requires no increase in model size or training data volume—only 64K samples are used to fine-tune Llama3.1-8B—yet achieves a 41.3% LC win rate on AlpacaEval 2, substantially outperforming both Llama3.1-405B-Instruct and GPT-4. The core contributions are threefold: (i) introducing the first supervised aggregation fine-tuning paradigm; (ii) enabling controllable, synthesis-based draft-to-answer transformation; and (iii) establishing an efficient, scalable, lightweight inference pathway.

Technology Category

Application Category

📝 Abstract

Scaling data and model size has been proven effective for boosting the performance of large language models. In addition to training-time scaling, recent studies have revealed that increasing test-time computational resources can further improve performance. In this work, we introduce Aggregation Fine-Tuning (AFT), a supervised finetuning paradigm where the model learns to synthesize multiple draft responses, referred to as proposals, into a single, refined answer, termed aggregation. At inference time, a propose-and-aggregate strategy further boosts performance by iteratively generating proposals and aggregating them. Empirical evaluations on benchmark datasets show that AFT-trained models substantially outperform standard SFT. Notably, an AFT model, fine-tuned from Llama3.1-8B-Base with only 64k data, achieves a 41.3% LC win rate on AlpacaEval 2, surpassing significantly larger LLMs such as Llama3.1-405B-Instruct and GPT4. By combining sequential refinement and parallel sampling, the propose-and-aggregate framework scales inference-time computation in a flexible manner. Overall, These findings position AFT as a promising approach to unlocking additional capabilities of LLMs without resorting to increasing data volume or model size.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Performance Enhancement

Intelligence Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aggregated Fine-Tuning

Large Language Models

Quality Enhancement

🔎 Similar Papers

No similar papers found.