Speed and Conversational Large Language Models: Not All Is About Tokens per Second

📅 2024-08-01
🏛️ Computer
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM inference speed benchmarks—typically reported in tokens/second—lack task awareness and fail to reflect real-world deployment latency, obscuring critical dependencies on generation length, context size, and interaction patterns. Method: We propose the first task-aware LLM inference speed evaluation framework, decoupling these factors via CUDA profiling, dynamic batching analysis, and realistic dialogue trace replay across multiple backends (Llama.cpp, vLLM, HuggingFace Transformers). Contribution/Results: Empirical evaluation reveals up to 3.2× end-to-end latency variation for the same model across dialogue vs. summarization tasks. Key bottlenecks are identified as inefficient KV cache management and imbalance between prefill and decode phases. Our framework enables quantitative performance attribution, providing actionable insights for optimizing LLM deployment efficiency on GPU hardware.

Technology Category

Application Category

📝 Abstract
The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
Problem

Research questions and friction points this paper is trying to address.

Analyze speed of open-weights LLMs
Dependency on tasks and GPUs
Compare popular open LLMs speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-weights LLMs speed analysis
GPU performance dependency study
Comparative speed evaluation methodology
🔎 Similar Papers
No similar papers found.
J
Javier Conde
ETSI de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain
Miguel González
Miguel González
ETSI de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain
P
Pedro Reviriego
ETSI de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain
Zhen Gao
Zhen Gao
Beijing Institute of Technology
Generative AI6GMIMO communicationsIoT edge computingLarge Model
S
Shanshan Liu
University of Electronic Science and Technology of China, 611731 Chengdu, China
Fabrizio Lombardi
Fabrizio Lombardi
International Test Conference (ITC) Endowed Chair Professor, Northeastern University
Computer ArithmeticDigital CircuitsMemory DesignApproximate ComputingNanocomputing