Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the growing inference load on cloud infrastructure caused by surging LLM query volumes, this paper investigates the feasibility of offloading practical query tasks to lightweight local models on power-constrained devices (e.g., laptops). We propose a novel evaluation metric—Intelligence per Watt (IPW)—to systematically quantify the trade-offs among energy efficiency, accuracy, and latency in local AI inference. Our end-to-end evaluation encompasses over 20 state-of-the-art open-weight LLMs and eight accelerator hardware platforms, benchmarked on over one million real-world, single-turn dialogue and reasoning queries. Results show that local models correctly answer 88.7% of real user queries; IPW improved 5.3× between 2023–2025, while local inference coverage rose from 23.2% to 71.3%. These findings demonstrate substantial optimization potential and practical viability for on-device AI deployment.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals $3$ findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.
Problem

Research questions and friction points this paper is trying to address.

Measuring local AI's ability to redistribute demand from centralized cloud infrastructure
Assessing local language models' accuracy on real-world queries under power constraints
Evaluating intelligence efficiency through task accuracy per unit of power consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes intelligence per watt metric for local AI efficiency
Evaluates small LMs on local accelerators for chat queries
Shows local inference can redistribute cloud demand effectively
🔎 Similar Papers
No similar papers found.
Jon Saad-Falcon
Jon Saad-Falcon
PhD Student at Stanford University
Natural Language ProcessingMachine LearningInformation RetrievalSystems for ML
A
A. Narayan
Department of Computer Science, Stanford University, Stanford, CA, USA
H
Hakki O. Akengin
Department of Computer Science, Stanford University, Stanford, CA, USA
J
J. W. Griffin
Department of Computer Science, Stanford University, Stanford, CA, USA
H
Herumb Shandilya
Department of Computer Science, Stanford University, Stanford, CA, USA
A
Adrian Gamarra Lafuente
Department of Computer Science, Stanford University, Stanford, CA, USA
M
Medhya Goel
Department of Computer Science, Stanford University, Stanford, CA, USA
R
Rebecca Joseph
Department of Computer Science, Stanford University, Stanford, CA, USA
S
Shlok Natarajan
Department of Computer Science, Stanford University, Stanford, CA, USA
E
E. Guha
Department of Computer Science, Stanford University, Stanford, CA, USA
S
Shang Zhu
Together AI, San Francisco, CA, USA
Ben Athiwaratkun
Ben Athiwaratkun
Together AI
Artificial Intelligence
John Hennessy
John Hennessy
Department of Computer Science, Stanford University, Stanford, CA, USA
Azalia Mirhoseini
Azalia Mirhoseini
Assistant Professor of Computer Science, Stanford - Google DeepMind
AISystemsScaling
C
Christopher R'e
Department of Computer Science, Stanford University, Stanford, CA, USA