InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high data and computational costs associated with enhancing large language models’ (LLMs) reasoning capabilities during post-training, this paper proposes InfiAlign—a scalable, sample-efficient alignment framework. Methodologically, it introduces (1) an automated data curation pipeline guided by multi-dimensional quality assessment, overcoming the cross-task and cross-source scalability limitations of heuristic-based filtering; and (2) an end-to-end reasoning alignment strategy that jointly integrates supervised fine-tuning (SFT) and direct preference optimization (DPO). Evaluated on Qwen2.5-Math-7B-Base, InfiAlign achieves performance comparable to DeepSeek-R1-Distill-Qwen-7B—despite using only 12% of publicly available reasoning data—yielding an average 3.89% improvement on AIME 2024/2025 benchmarks. The framework significantly reduces data dependency while preserving strong generalization across diverse reasoning tasks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning with scalable post-training alignment
Reducing data and computational costs in LLM alignment
Improving generalization across diverse reasoning tasks efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines SFT and DPO for efficient LLM alignment
Automates high-quality data selection using multidimensional metrics
Reduces data needs while improving reasoning performance
🔎 Similar Papers
No similar papers found.