Speed and Conversational Large Language Models: Not All Is About Tokens per Second

📅 2024-08-01

🏛️ Computer

📈 Citations: 1

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing LLM inference speed benchmarks—typically reported in tokens/second—lack task awareness and fail to reflect real-world deployment latency, obscuring critical dependencies on generation length, context size, and interaction patterns. Method: We propose the first task-aware LLM inference speed evaluation framework, decoupling these factors via CUDA profiling, dynamic batching analysis, and realistic dialogue trace replay across multiple backends (Llama.cpp, vLLM, HuggingFace Transformers). Contribution/Results: Empirical evaluation reveals up to 3.2× end-to-end latency variation for the same model across dialogue vs. summarization tasks. Key bottlenecks are identified as inefficient KV cache management and imbalance between prefill and decode phases. Our framework enables quantitative performance attribution, providing actionable insights for optimizing LLM deployment efficiency on GPU hardware.

Technology Category

Application Category

📝 Abstract

The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.

Problem

Research questions and friction points this paper is trying to address.

Analyze speed of open-weights LLMs

Dependency on tasks and GPUs

Compare popular open LLMs speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-weights LLMs speed analysis

GPU performance dependency study

Comparative speed evaluation methodology

🔎 Similar Papers

No similar papers found.