T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the challenge of efficiently deploying large language model (LLM) inference on resource-constrained edge CPU platforms, this paper proposes a full-stack co-designed ternary LLM inference framework. The method innovatively leverages SIMD register files to dynamically construct on-chip lookup tables, enabling register-level in-situ computation and circumventing memory-access bottlenecks. It integrates ternary quantization, register reorganization, ALU restructuring, and data-level parallelism to significantly accelerate GEMM and GEMV—core operators in LLM inference. Evaluated on an NVIDIA Jetson AGX Orin, the framework achieves 5.6–24.5× lower GEMM latency and 1.1–86.2× higher GEMV throughput, while improving energy efficiency by 2.5–4.9×. Critically, it incurs only 3.2% additional power consumption and 1.4% area overhead. This work establishes a scalable, low-power LLM deployment paradigm tailored for pure-CPU edge devices.

Technology Category

Application Category

📝 Abstract

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.

Problem

Research questions and friction points this paper is trying to address.

Enabling efficient ternary LLM inference on CPU-only edge platforms

Eliminating memory bottlenecks in ternary quantization via SIMD reorganization

Achieving scalable edge deployment without FPGA/GPU accelerator dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposing SIMD register file for LUT generation

Enabling scalable ternary LLM inference on CPUs

Minimizing hardware modifications for edge deployment

🔎 Similar Papers

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective