Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Deploying large language models in web browsers is constrained by limited memory and hardware heterogeneity, making it challenging to simultaneously achieve efficiency, privacy, and portability. This work addresses these limitations by introducing a novel WebGPU-based backend for llama.cpp, featuring static memory planning, templated GPU kernels, and multi-precision quantization support. Coupled with a cross-vendor device tuning mechanism, the proposed approach enables efficient and portable browser-side inference. Experimental evaluation across 16 heterogeneous devices demonstrates a 29–33% reduction in memory consumption and a 45–69% increase in decoding throughput compared to existing browser frameworks. In several scenarios, the performance even matches or surpasses that of native backends.

📝 Abstract

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama.cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama.cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

WebGPU

memory efficiency

performance portability

multi-precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

WebGPU

memory-efficient inference

performance portability