ZAYA1-8B Technical Report

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work proposes ZAYA1-8B, a sparse mixture-of-experts (MoE) model with fewer than one billion activated parameters, designed to narrow the performance gap between small and large models in mathematical and code reasoning. From pretraining onward, the model integrates reasoning-oriented data and leverages an MoE++ architecture, answer-preserving pruning, and a four-stage cascaded reinforcement learning pipeline. It further introduces a novel Markovian RSA inference mechanism at test time, which efficiently aggregates results from multiple reasoning trajectories using only short inference suffixes. Trained entirely on AMD’s full-stack platform, ZAYA1-8B achieves 91.9% accuracy on AIME'25 and 89.6% on HMMT'25, matching or surpassing DeepSeek-R1-0528 and approaching the performance of much larger models such as Gemini-2.5 Pro.
📝 Abstract
We present ZAYA1-8B, a reasoning-focused mixture-of-experts (MoE) model with 700M active and 8B total parameters, built on Zyphra's MoE++ architecture. ZAYA1-8B's core pretraining, midtraining, and supervised fine-tuning (SFT) were performed on a full-stack AMD compute, networking, and software platform. With under 1B active parameters, ZAYA1-8B matches or exceeds DeepSeek-R1-0528 on several challenging mathematics and coding benchmarks, and remains competitive with substantially larger open-weight reasoning models. ZAYA1-8B was trained from scratch for reasoning, with reasoning data included from pretraining onward using an answer-preserving trimming scheme. Post-training uses a four-stage RL cascade: reasoning warmup on math and puzzles; a 400-task RLVE-Gym curriculum; math and code RL with test-time compute traces and synthetic code environments built from competitive-programming references; and behavioral RL for chat and instruction following. We also introduce Markovian RSA, a test-time compute method that recursively aggregates parallel reasoning traces while carrying forward only bounded-length reasoning tails between rounds. In TTC evaluation, Markovian RSA raises ZAYA1-8B to 91.9\% on AIME'25 and 89.6\% on HMMT'25 while carrying forward only a 4K-token tail, narrowing the gap to much larger reasoning models including Gemini-2.5 Pro, DeepSeek-V3.2, and GPT-5-High.
Problem

Research questions and friction points this paper is trying to address.

reasoning
mixture-of-experts
mathematics
code generation
efficient inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
reasoning-focused training
Markovian RSA
reinforcement learning cascade
answer-preserving trimming
🔎 Similar Papers
No similar papers found.