Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the challenge of deploying large language models on edge devices, which is hindered by a lack of systematic understanding of inference latency and energy efficiency scaling across heterogeneous hardware (CPU/GPU/NPU). The authors propose QEIL, a unified framework that, for the first time, uncovers stable power-law scaling behaviors of Transformer models with respect to latency, energy consumption, and task coverage. Leveraging these insights, QEIL introduces three composite metrics and a safety-aware intelligent scheduler to enable coordinated optimization across heterogeneous accelerators from diverse vendors. Through formal modeling, computational orchestration, thermal management, fault-tolerant execution, and hardware health monitoring, QEIL significantly improves energy efficiency, reduces latency, and expands task coverage across five model families—while preserving model accuracy and ensuring system safety.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) on resource constrained edge devices is limited by a poor understanding of inference time scaling on heterogeneous hardware. We present QEIL (Quantifying Edge Intelligence via Inference time Scaling Formalisms), a unified framework to characterize and optimize inference across CPUs, GPUs, and NPUs. QEIL reveals stable power law scaling behavior in latency, energy, and task coverage for transformer models ranging from 125M to 2.6B parameters, and demonstrates that heterogeneous orchestration with intelligent coordination across mixed accelerators consistently improves energy efficiency and coverage compared to homogeneous execution. QEIL introduces three composite metrics: Intelligence per Watt, Energy Coverage Efficiency, and Price Power Performance, enabling multi objective optimization for edge intelligence. A safety first agentic orchestrator dynamically allocates workloads across same vendor and cross vendor accelerators while enforcing thermal constraints, fault tolerant execution, adversarial input validation, and continuous hardware health monitoring. Evaluations across five model families show that QEIL achieves consistent improvements in efficiency, latency, and coverage without sacrificing accuracy or system safety, establishing inference time scaling and heterogeneous orchestration as key foundations for reliable edge AI.

Problem

Research questions and friction points this paper is trying to address.

Edge Intelligence

Inference Time Scaling

Heterogeneous Computing

Large Language Models

Resource-Constrained Devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time scaling

Heterogeneous orchestration

Edge intelligence