HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the fundamental latency–accuracy–throughput trade-off in LLM inference, this paper proposes a resource-adaptive dynamic scheduling framework. First, it enables input-aware model selection via real-time online evaluation of candidate models driven by prompt subsets. Second, it introduces a jointly optimized hierarchical progressive loading and early-exit mechanism, dynamically loading only the minimal required number of layers—breaking the constraint of static full-model loading. Third, it integrates performance-monitoring-driven model reselection with a unified execution framework for Early-Exit LLMs. Compared to baselines, our approach achieves 1.48× higher throughput, 1.39× lower latency, 1.10× improved energy efficiency, and 3.7× larger maximum batch size. To the best of our knowledge, this is the first work to enable runtime model switching and adaptive early-exit depth coordination.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an output token early, thus reducing latency without impacting accuracy. However, as the early exits taken depend on the task and are unknown apriori to request processing, EE-LLMs conservatively load the entire model, limiting resource savings and throughput. Also, current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers. This approach yields memory savings which enables us to process more requests at the same time, thereby improving throughput. Third, HELIOS monitors and periodically reassesses the performance of the candidate LLMs and if needed, switches to another model that can service incoming queries more efficiently (such as using fewer layers without lowering accuracy). Our evaluations show that HELIOS achieves 1.48$ imes$ throughput, 1.10$ imes$ energy-efficiency, 1.39$ imes$ lower response time, and 3.7$ imes$ improvements in inference batch sizes compared to the baseline, when optimizing for the respective service level objectives.

Problem

Research questions and friction points this paper is trying to address.

Balancing latency, accuracy, and throughput in LLM deployment.

Optimizing resource usage by dynamically loading model layers.

Adapting to changing query patterns for efficient inference.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive model selection using real-time telemetry data

Greedy layer loading for memory and throughput optimization

Dynamic model switching based on performance monitoring

🔎 Similar Papers

No similar papers found.