Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

📅 2025-12-06

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing lookup-table (LUT)-based parallel inference for ultra-low-bit large language models (LLMs) on edge-device CPUs suffers from low memory bandwidth utilization, primarily due to redundant and non-contiguous memory accesses incurred by scalar LUTs operating independently per token. Method: This paper proposes a vectorized lookup paradigm comprising three key innovations: (1) a unified vector LUT spanning multiple tokens, enabling single-instruction, multiple-output table lookups; (2) a centralized tensor layout optimized for vectorized access patterns; and (3) a cache-aware streaming lookup mechanism to minimize cache misses and memory stalls. Results: Evaluated across five edge-device CPU platforms and three LLMs, our approach achieves up to 4.2× higher inference throughput over state-of-the-art methods. The implementation has been integrated into llama.cpp and open-sourced.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 ightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2 imes$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/Cipherxzc/vlut.cpp.

Problem

Research questions and friction points this paper is trying to address.

Addresses memory bandwidth underutilization in parallel ultra-low-bit LLM inference

Replaces scalar LUT with vector LUT to unify lookups across tokens

Enables efficient on-device intelligence via optimized tensor layout and caching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector LUT paradigm for parallel token lookup

Vector LUT-centric tensor layout optimization

Cache-aware streamed lookup technique

🔎 Similar Papers

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge