Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the inefficiency of large language model inference on consumer-grade CPUs, where existing frameworks fail to exploit the computational advantages of ternary neural networks (with weights in {−1, 0, +1}). We present the first CPU-optimized inference kernels specifically designed for ternary models, leveraging custom SIMD implementations to reformulate matrix multiplication as integer addition and subtraction, while fully utilizing modern CPUs’ integer dot-product instructions. Departing from conventional floating-point emulation paradigms, our approach achieves a 9.2× reduction in first-token latency, a 52× increase in throughput, and a 14× decrease in memory footprint on Apple Silicon, with substantial speedups also observed on Intel and AMD platforms. The solution supports seamless integration with Hugging Face and is compatible with PyTorch-based deployment pipelines.
📝 Abstract
Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction compared to standard PyTorch inference on Apple Silicon, with similar speedups on Intel and AMD processors.
Problem

Research questions and friction points this paper is trying to address.

ternary neural networks
LLM inference
consumer CPUs
SIMD
efficient inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary Neural Networks
SIMD Kernels
CPU Inference
Integer Dot Product
Model Quantization