Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low decoding efficiency and suboptimal hardware utilization of large language models (LLMs) in long-context reasoning, this paper proposes Step-3, a hardware-aware model-system co-design. Methodologically: (i) we introduce Multi-Matrix Factorization Attention (MFA), the first attention mechanism that significantly compresses KV cache and reduces attention computation overhead; (ii) we propose Attention-FFN Disaggregation (AFD) to enable modular decoupling and distributed scalability; and (iii) we integrate MoE-based sparse activation with FP8 quantization. Evaluated on Hopper GPUs, Step-3 achieves 4,039 tokens/s decoding throughput and <50 ms time-per-output-token (TPOT) at 4K context length—outperforming DeepSeek-V3 and establishing a new Pareto frontier for LLM decoding efficiency.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.
Problem

Research questions and friction points this paper is trying to address.

Reducing decoding costs in large language models (LLMs)
Optimizing hardware efficiency for long-context reasoning tasks
Improving throughput and cost-effectiveness of model-system co-design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Matrix Factorization Attention reduces KV cache
Attention-FFN Disaggregation decouples key components
Hardware-aware co-design minimizes decoding costs
🔎 Similar Papers
No similar papers found.
B
Bin Wang
StepFun
B
Bojun Wang
StepFun
C
Changyi Wan
StepFun
G
Guanzhe Huang
StepFun
Hanpeng Hu
Hanpeng Hu
The University of Hong Kong
Distributed MLML Diagnosis and Optimization
H
Haonan Jia
StepFun
Hao Nie
Hao Nie
Stepfun
Mingliang Li
Mingliang Li
Tsinghua University
Computer System
N
Nuo Chen
StepFun
S
Siyu Chen
StepFun
Song Yuan
Song Yuan
Zhejiang University, CAGE
Development EconomicsInternational EconomicsPolitical EconomyEconomic History
W
Wuxun Xie
StepFun
X
Xiaoniu Song
StepFun
X
Xing Chen
StepFun
X
Xingping Yang
StepFun
X
Xuelin Zhang
StepFun
Y
Yanbo Yu
StepFun
Y
Yaoyu Wang
StepFun
Y
Yibo Zhu
StepFun
Yimin Jiang
Yimin Jiang
Unknown affiliation
Machine Learning Systems
Y
Yu Zhou
StepFun
Y
Yuanwei Lu
StepFun
Houyi Li
Houyi Li
Senior Researcher, StepFun
Large Language Models