FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work proposes the first multiplication-free large language model inference on general-purpose CPUs, addressing the memory bandwidth bottleneck that limits performance on CPU-only platforms. By ternarizing weights to \{-1, 0, +1\}, the method replaces floating-point multiplications with conditional additions and subtractions. It further enhances computational efficiency by fusing eight sub-GEMV operations within each generalized linear layer and leveraging AVX-512 vectorization with masking. Evaluated on a single Intel Xeon 8558P processor, the system achieves 32.4 tokens per second—1.24× faster than llama.cpp’s Q4_K_M configuration—while maintaining high accuracy: a WikiText-2 perplexity of 5.52 (compared to 5.47 for FP16) and 66.0% accuracy on downstream tasks, closely approaching FP16-level performance.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in {-1, 0, +1} provide a more efficient alternative, replacing multiplications with conditional additions, subtractions, or no-ops. While Fairy2i shows that ternary LLMs can match FP16 quality, its runtime does not exploit this structure. We present FairyFuse, an inference system that enables multiplication-free execution on commodity CPUs by fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, with zero floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup while offering little benefit on GPUs. End-to-end, FairyFuse achieves 32.4 tokens per second on a single Intel Xeon 8558P, outperforming llama.cpp Q4_K_M by 1.24x with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 FP16; downstream accuracy 66.0%).

Problem

Research questions and friction points this paper is trying to address.

LLM inference

CPU

memory bandwidth bottleneck

ternary weights

multiplication-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

multiplication-free

ternary quantization

fused kernels