Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

256K/year

🤖 AI Summary

To address the growing inference load on cloud infrastructure caused by surging LLM query volumes, this paper investigates the feasibility of offloading practical query tasks to lightweight local models on power-constrained devices (e.g., laptops). We propose a novel evaluation metric—Intelligence per Watt (IPW)—to systematically quantify the trade-offs among energy efficiency, accuracy, and latency in local AI inference. Our end-to-end evaluation encompasses over 20 state-of-the-art open-weight LLMs and eight accelerator hardware platforms, benchmarked on over one million real-world, single-turn dialogue and reasoning queries. Results show that local models correctly answer 88.7% of real user queries; IPW improved 5.3× between 2023–2025, while local inference coverage rose from 23.2% to 71.3%. These findings demonstrate substantial optimization potential and practical viability for on-device AI deployment.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals $3$ findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.

Problem

Research questions and friction points this paper is trying to address.

Measuring local AI's ability to redistribute demand from centralized cloud infrastructure

Assessing local language models' accuracy on real-world queries under power constraints

Evaluating intelligence efficiency through task accuracy per unit of power consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes intelligence per watt metric for local AI efficiency

Evaluates small LMs on local accelerators for chat queries

Shows local inference can redistribute cloud demand effectively

🔎 Similar Papers

No similar papers found.