🤖 AI Summary
Existing LLM inference speed benchmarks—typically reported in tokens/second—lack task awareness and fail to reflect real-world deployment latency, obscuring critical dependencies on generation length, context size, and interaction patterns. Method: We propose the first task-aware LLM inference speed evaluation framework, decoupling these factors via CUDA profiling, dynamic batching analysis, and realistic dialogue trace replay across multiple backends (Llama.cpp, vLLM, HuggingFace Transformers). Contribution/Results: Empirical evaluation reveals up to 3.2× end-to-end latency variation for the same model across dialogue vs. summarization tasks. Key bottlenecks are identified as inefficient KV cache management and imbalance between prefill and decode phases. Our framework enables quantitative performance attribution, providing actionable insights for optimizing LLM deployment efficiency on GPU hardware.
📝 Abstract
The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.