🤖 AI Summary
This work addresses the suboptimal performance of Transformer decoders on challenging reasoning tasks—such as mathematical problem solving and code generation—under long-context settings (up to 128K tokens). To this end, we introduce the xGen-small model family (4B/9B parameters). Our method features a domain-balanced and frequency-aware data curation strategy, quality-annealing multi-stage pretraining, and a hierarchical post-training paradigm integrating supervised fine-tuning, DPO-based preference optimization, and online PPO reinforcement learning. Furthermore, we enhance long-sequence modeling via optimized positional encoding and curriculum-driven training scheduling. Empirically, xGen-small achieves state-of-the-art results on long-context benchmarks (e.g., LooGLE, StreamingQA), significantly outperforming same-scale models. It also attains SOTA performance on core reasoning benchmarks, including GSM8K and HumanEval, demonstrating substantial gains in both mathematical reasoning and code generation capabilities.
📝 Abstract
We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.