Super Apriel: One Checkpoint, Many Speeds

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work proposes a 15B-parameter hypernetwork architecture enabling multi-speed inference from a single checkpoint, eliminating the need to deploy multiple independent models for varying latency requirements. The model integrates four attention mixers—FA, SWA, KDA, and GDN—within each decoder layer, allowing runtime structural reconfiguration to support speculative decoding without a separate draft model. Leveraging stochastic distillation, supervised fine-tuning, proxy-guided configuration search, and the vLLM efficient inference framework, the full-FA configuration faithfully reproduces the teacher model’s performance. Recommended hybrid configurations achieve 2.9–10.7× throughput gains while retaining 77%–96% of generation quality, with benefits becoming increasingly pronounced as context length grows.

Technology Category

Application Category

📝 Abstract

We release Super Apriel, a 15B-parameter supernet in which every decoder layer provides four trained mixer choices -- Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span $2.9\times$ to $10.7\times$ decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per-layer mixer assignment makes the speed-quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning. We release the supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.

Problem

Research questions and friction points this paper is trying to address.

supernet

mixture-of-attention

speed-quality tradeoff

dynamic inference

speculative decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

supernet

mixture-of-attention

dynamic inference