🤖 AI Summary
This work addresses the inefficiency of existing large language model (LLM) inference configuration evaluation, which requires time-consuming full-profile runs for every hardware-engine-model combination. The authors propose a configuration-agnostic and redundancy-aware profiling mechanism that leverages taint propagation to trace the structural origins of operand input dimensions, enabling a single inference pass to generate profiling data shareable across diverse configurations. By integrating selective profiling, latency database construction, and regression modeling, the approach enables highly efficient performance simulation. Evaluated across two GPU platforms, three attention backends, and multiple model architectures, the method achieves simulation errors within 5% for Time-To-First-Token (TTFT) and 8% for Time-Per-Output-Token (TPOT), while reducing GPU profiling overhead by 56.4% compared to baseline methods.
📝 Abstract
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.