T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the performance bottleneck of non-GEMM operations—particularly dequantization—in end-to-end low-bit LLM inference on NPUs, which causes slower execution than CPUs, this paper proposes a unified lookup-table (LUT)-based co-acceleration framework. Our method introduces: (1) the first LUT-based dequantization mechanism supporting both prefill and decoding phases uniformly; (2) a hierarchical tiling strategy featuring two-level fused LUTs and concurrency-guided partitioning; and (3) hardware-efficient low-bit execution via vector-unit mapping and a three-stage pipeline. Experiments demonstrate 1.4× and 3.1× speedups in prefill and decoding, respectively, over baseline NPU implementations, alongside an 84% reduction in energy consumption. These results significantly alleviate the accuracy–efficiency trade-off that has historically constrained low-bit LLM inference on NPUs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed on customer devices. To support them, current devices are adopting SoCs (System on Chip) with NPUs (Neural Processing Unit) installed. Although high performance is expected, LLM inference on NPUs is slower than its CPU counterpart. The reason is that NPUs have poor performance on computations other than GEMM, like dequantization. Current works either disaggregate prefill on the NPUs and decoding on the CPUs, or put both on the NPUs but with an accuracy loss. To solve this issue, based on the insight that low-bit can enable target computation encoded within an acceptably sized table, we propose table lookup to subsume hardware operations otherwise unsupported. To realize this, we overcome the conflicting hardware behavior of prefill and decoding to design a unified table layout and tiling through (1) fused two-level table-based dequantization and (2) concurrency-hierarchy-guided tiling. Based on that, we implement the prefill phase by three-stage pipeline and map the table-lookup-based decoding to NPU's vector units. Results show 1.4x and 3.1x speedup for prefill and decoding respectively, and 84% energy savings compared to the baseline NPU methods. The code is available at https://github.com/microsoft/T-MAC/tree/main/t-man.

Problem

Research questions and friction points this paper is trying to address.

Slow LLM inference on NPUs due to poor non-GEMM computation performance

Current solutions split prefill and decoding or cause accuracy loss

Dequantization operations hinder efficient NPU utilization for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified table lookup replaces unsupported NPU operations

Fused two-level table enables efficient dequantization

Hierarchical tiling optimizes prefill and decoding concurrency

🔎 Similar Papers

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

2024-06-25arXiv.orgCitations: 6

AMD

San Jose, CA (Hybrid) / other US locations

Authors to Follow