Technology solutions targeting the performance of gen-AI inference in resource constrained platforms

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the severe memory capacity and bandwidth bottlenecks faced by generative AI during inference on resource-constrained devices, particularly in long-context and multimodal scenarios. The authors propose a hierarchical roofline performance model to systematically evaluate, for the first time, the bandwidth and latency requirements of high-bandwidth storage (HBS) in large-model long-context inference, establishing clear HBS performance thresholds necessary to achieve interactive throughput. For smaller models, they design an efficient memory utilization scheme leveraging bonded global buffer chips. Experimental results demonstrate that the proposed approaches substantially alleviate memory pressure and improve energy efficiency, offering a critical technical pathway for deploying generative AI at the edge.

Technology Category

Application Category

📝 Abstract

The rise of generative AI workloads, particularly language model inference, is intensifying on/off-chip memory pressure. Multimodal inputs such as video streams or images and downstream applications like Question Answering (QA) and analysis over large documents incur long context lengths, requiring caching of massive Key and Value states of the previous tokens. Even a low degree of concurrent inference serving on resource-constrained devices, like mobiles, can further add to memory capacity pressure and runtime memory management complexity. In this paper, we evaluate the performance implications of two emerging technology solutions to alleviate the memory pressure in terms of both capacity and bandwidth using a hierarchical roofline-based analytical performance model. For large models (e.g., 13B parameters) and context lengths, we investigate the performance implications of High Bandwidth Storage (HBS) and outline bandwidth/latency requirements to achieve an acceptable throughput for interactivity. For small models (e.g., 1B parameters), we evaluate the merit of a bonded global buffer memory chiplet and propose how to best utilize it.

Problem

Research questions and friction points this paper is trying to address.

generative AI inference

memory pressure

resource-constrained platforms

long context lengths

Key-Value caching

Innovation

Methods, ideas, or system contributions that make the work stand out.

High Bandwidth Storage

Memory Chiplet

Generative AI Inference