🤖 AI Summary
To address the prevailing reliance on scaling data and model parameters for improving large language model (LLM) performance, this paper proposes Aggregation Fine-Tuning (AFT). AFT employs supervised fine-tuning to teach models to end-to-end aggregate multiple draft responses into an optimized final answer; during inference, it adopts a “generate–aggregate” iterative strategy, tightly coupling test-time compute scaling with the training paradigm. Crucially, AFT requires no increase in model size or training data volume—only 64K samples are used to fine-tune Llama3.1-8B—yet achieves a 41.3% LC win rate on AlpacaEval 2, substantially outperforming both Llama3.1-405B-Instruct and GPT-4. The core contributions are threefold: (i) introducing the first supervised aggregation fine-tuning paradigm; (ii) enabling controllable, synthesis-based draft-to-answer transformation; and (iii) establishing an efficient, scalable, lightweight inference pathway.
📝 Abstract
Scaling data and model size has been proven effective for boosting the performance of large language models. In addition to training-time scaling, recent studies have revealed that increasing test-time computational resources can further improve performance. In this work, we introduce Aggregation Fine-Tuning (AFT), a supervised finetuning paradigm where the model learns to synthesize multiple draft responses, referred to as proposals, into a single, refined answer, termed aggregation. At inference time, a propose-and-aggregate strategy further boosts performance by iteratively generating proposals and aggregating them. Empirical evaluations on benchmark datasets show that AFT-trained models substantially outperform standard SFT. Notably, an AFT model, fine-tuned from Llama3.1-8B-Base with only 64k data, achieves a 41.3% LC win rate on AlpacaEval 2, surpassing significantly larger LLMs such as Llama3.1-405B-Instruct and GPT4. By combining sequential refinement and parallel sampling, the propose-and-aggregate framework scales inference-time computation in a flexible manner. Overall, These findings position AFT as a promising approach to unlocking additional capabilities of LLMs without resorting to increasing data volume or model size.