LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the challenge of deploying large language models on edge devices, where hardware constraints such as memory bandwidth, power consumption, and thermal limitations necessitate efficient cross-platform architecture search. The authors propose LLMForge, a framework that expands the search space through Infinite-Head Attention (IHA) and integrates a transformer-based surrogate model, Forge-Former, with a multi-objective, hardware-aware design space explorer, Forge-DSE, supporting diverse backends including GPUs, systolic arrays, and ring-based dataflow accelerators. On a ring-based multi-chip platform, LLMForge automatically discovers three Pareto-optimal 300M-scale models: one achieving a validation loss of 2.798 for peak accuracy, another reducing per-token energy consumption by 40% for optimal efficiency, and a third cutting both first-token and subsequent-token latency by 43% for minimal delay.

📝 Abstract

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.

Problem

Research questions and friction points this paper is trying to address.

edge language models

hardware-aware neural architecture search

memory-bandwidth constraints

energy efficiency

multi-backend deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infinite-Head Attention

Hardware-Aware NAS

Multi-Backend Optimization