xGen-small Technical Report

📅 2025-05-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the suboptimal performance of Transformer decoders on challenging reasoning tasks—such as mathematical problem solving and code generation—under long-context settings (up to 128K tokens). To this end, we introduce the xGen-small model family (4B/9B parameters). Our method features a domain-balanced and frequency-aware data curation strategy, quality-annealing multi-stage pretraining, and a hierarchical post-training paradigm integrating supervised fine-tuning, DPO-based preference optimization, and online PPO reinforcement learning. Furthermore, we enhance long-sequence modeling via optimized positional encoding and curriculum-driven training scheduling. Empirically, xGen-small achieves state-of-the-art results on long-context benchmarks (e.g., LooGLE, StreamingQA), significantly outperforming same-scale models. It also attains SOTA performance on core reasoning benchmarks, including GSM8K and HumanEval, demonstrating substantial gains in both mathematical reasoning and code generation capabilities.

Technology Category

Application Category

📝 Abstract
We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing Transformer models for long-context applications
Developing a multi-stage training pipeline for quality enhancement
Improving performance in math, coding, and long-context benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-balanced, frequency-aware data curation
Multi-stage pre-training with quality annealing
Targeted post-training via reinforcement learning
🔎 Similar Papers