BeLLMan: Controlling LLM Congestion

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe inference latency escalation and degraded user experience of large language models (LLMs) under high load—caused by their autoregressive generation nature—this paper proposes an infrastructure-application co-design framework for dynamic output length control. It introduces the first LLM infrastructure-level mechanism that actively senses system congestion and dynamically adjusts application-layer generation length in real time, breaking from conventional “black-box” inference paradigms. We design a lightweight, adaptive signaling control protocol, deployed end-to-end on H100 GPU clusters, jointly optimizing latency and energy consumption. Experimental results demonstrate up to 8× reduction in end-to-end latency, 25% lower energy consumption, and 19% higher request throughput under congestion—significantly improving scalability and service stability of production LLM deployments.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.
Problem

Research questions and friction points this paper is trying to address.

Controls LLM output length during system congestion
Reduces inference latency and energy consumption
Improves user experience under high workload conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active control of LLM output length
Dynamic adjustment to system load
Reduced latency and energy consumption
🔎 Similar Papers
No similar papers found.