xGen-small Technical Report

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the suboptimal performance of Transformer decoders on challenging reasoning tasks—such as mathematical problem solving and code generation—under long-context settings (up to 128K tokens). To this end, we introduce the xGen-small model family (4B/9B parameters). Our method features a domain-balanced and frequency-aware data curation strategy, quality-annealing multi-stage pretraining, and a hierarchical post-training paradigm integrating supervised fine-tuning, DPO-based preference optimization, and online PPO reinforcement learning. Furthermore, we enhance long-sequence modeling via optimized positional encoding and curriculum-driven training scheduling. Empirically, xGen-small achieves state-of-the-art results on long-context benchmarks (e.g., LooGLE, StreamingQA), significantly outperforming same-scale models. It also attains SOTA performance on core reasoning benchmarks, including GSM8K and HumanEval, demonstrating substantial gains in both mathematical reasoning and code generation capabilities.

Technology Category

Application Category

📝 Abstract

We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing Transformer models for long-context applications

Developing a multi-stage training pipeline for quality enhancement

Improving performance in math, coding, and long-context benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-balanced, frequency-aware data curation

Multi-stage pre-training with quality annealing

Targeted post-training via reinforcement learning

🔎 Similar Papers

What is the Role of Small Models in the LLM Era: A Survey