LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

243K/year
🤖 AI Summary
Large-scale recommendation models suffer from accuracy degradation and reduced training efficiency when directly adopting FP8 low-precision computation, due to their numerical sensitivity, dense small-matrix operations, and high communication overhead. This work proposes LoKA, a framework that enables efficient and practical FP8 training through co-design of system and model components. LoKA first performs statistical-driven online performance profiling based on real data distributions (LoKA Probe), then introduces reusable model modifications to enhance numerical stability (LoKA Mods), and finally employs a runtime scheduler that dynamically selects the optimal FP8 kernel under accuracy constraints (LoKA Dispatch). Experiments demonstrate that LoKA significantly accelerates FP8 training while preserving model accuracy, offering the first viable low-precision training solution for large-scale recommendation systems.
📝 Abstract
Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.
Problem

Research questions and friction points this paper is trying to address.

low-precision arithmetic
recommendation models
FP8
numerical sensitivity
model quality degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

FP8
low-precision computing
recommendation models
system-model co-design
kernel dispatch
🔎 Similar Papers
No similar papers found.
Liang Luo
Liang Luo
University of Washington
Systems for Machine LearningComputer SystemsComputer ArchitectureMachine Learning for Systems
Y
Yinbin Ma
Meta AI
Q
Quanyu Zhu
Meta AI
V
Vasiliy Kuznetsov
Meta AI
Yuxin Chen
Yuxin Chen
Meta
J
Jian Jiao
Meta AI
Jiecao Yu
Jiecao Yu
Research Scientist, Facebook
Computer ArchitectureMachine Learning
B
Buyun Zhang
Meta AI
T
Tongyi Tang
Meta AI
X
Xiaohan Wei
Meta AI
Y
Yanli Zhao
Meta AI
Z
Zeliang Chen
Meta AI
Yuchen Hao
Yuchen Hao
Meta
Computer ArchitectureMachine Learning SystemsRecommendation Systems
V
Venkatesh Ranganathan
Meta AI
S
Sandeep Parab
Meta AI
Y
Yantao Yao
Meta AI
Maxim Naumov
Maxim Naumov
Meta (Director of Engineering & Research)
Parallel AlgorithmsNumerical Linear AlgebraNumerical OptimizationGraphsDeep Learning
C
Chunzhi Yang
Meta AI
S
Shen Li
Meta AI
E
Ellie Wen
Meta AI
Wenlin Chen
Wenlin Chen
Meta Platforms
Machine LearningData MiningArtificial Intelligence
S
Santanu Kolay
Meta AI
C
Chunqiang Tang
Meta AI