Speed and Conversational Large Language Models: Not All Is About Tokens per Second

📅 2024-08-01
🏛️ Computer
📈 Citations: 1
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Existing LLM inference speed benchmarks—typically reported in tokens/second—lack task awareness and fail to reflect real-world deployment latency, obscuring critical dependencies on generation length, context size, and interaction patterns. Method: We propose the first task-aware LLM inference speed evaluation framework, decoupling these factors via CUDA profiling, dynamic batching analysis, and realistic dialogue trace replay across multiple backends (Llama.cpp, vLLM, HuggingFace Transformers). Contribution/Results: Empirical evaluation reveals up to 3.2× end-to-end latency variation for the same model across dialogue vs. summarization tasks. Key bottlenecks are identified as inefficient KV cache management and imbalance between prefill and decode phases. Our framework enables quantitative performance attribution, providing actionable insights for optimizing LLM deployment efficiency on GPU hardware.

Technology Category

Application Category

📝 Abstract
The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
Problem

Research questions and friction points this paper is trying to address.

Analyze speed of open-weights LLMs
Dependency on tasks and GPUs
Compare popular open LLMs speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-weights LLMs speed analysis
GPU performance dependency study
Comparative speed evaluation methodology
🔎 Similar Papers
No similar papers found.