Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large-batch LLM inference remains predominantly bandwidth-bound by GPU DRAM—not compute-bound as conventionally assumed—especially for smaller models, where memory bottlenecks leave computational units severely underutilized. Method: We propose the Batch Configuration Advisor (BCA), a system that dynamically optimizes memory allocation via low-level performance profiling and bandwidth modeling, and enhances resource reuse through lightweight model replication and concurrent scheduling—all while preserving low latency. Contribution/Results: BCA systematically refutes the prevailing assumption that large-batch inference has entered the compute-bound regime. It establishes a novel co-optimization paradigm integrating memory compression and computational reuse. Experiments demonstrate up to 2.3× higher throughput for small-model, large-batch inference, over 40% improvement in overall GPU utilization, and significant reduction in memory footprint.

Technology Category

Application Category

📝 Abstract

Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models.

Problem

Research questions and friction points this paper is trying to address.

Identifies GPU memory bottlenecks in large-batch LLM inference.

Proposes BCA to optimize memory allocation and reduce GPU requirements.

Challenges compute-bound assumptions, improving throughput via model replication.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Batching Configuration Advisor optimizes memory allocation

Model replication enhances serving throughput

Reduces GPU memory requirements effectively

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval