Deploying Open-Source Large Language Models: A performance Analysis

📅 2024-09-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of efficiently deploying open-source large language models (e.g., Mistral, LLaMA) on heterogeneous GPU hardware. We introduce the first practical, multi-dimensional performance benchmark spanning model scales (7B–70B), hardware configurations (A100, H100, etc.), and inference backends (vLLM). Our methodology integrates CUDA acceleration, GPU memory consumption modeling, and joint evaluation of throughput and time-to-first-token latency to systematically quantify performance boundaries across resource configurations. The key contribution is an empirically grounded model selection and deployment recommendation table—filling a critical gap in evidence-based LLM engineering practice. Experimental results demonstrate up to 240 tokens/sec/GPU throughput for 7B models on A100 GPUs. The benchmark provides reproducible, scalable guidance for both private and public institutions deploying LLMs in production environments.

Technology Category

Application Category

📝 Abstract
Since the release of ChatGPT in November 2022, large language models (LLMs) have seen considerable success, including in the open-source community, with many open-weight models available. However, the requirements to deploy such a service are often unknown and difficult to evaluate in advance. To facilitate this process, we conducted numerous tests at the Centre Inria de l'Universit'e de Bordeaux. In this article, we propose a comparison of the performance of several models of different sizes (mainly Mistral and LLaMa) depending on the available GPUs, using vLLM, a Python library designed to optimize the inference of these models. Our results provide valuable information for private and public groups wishing to deploy LLMs, allowing them to evaluate the performance of different models based on their available hardware. This study thus contributes to facilitating the adoption and use of these large language models in various application domains.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Performance Optimization
Hardware Adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

vLLM tool
large language models performance
hardware-specific guidance
🔎 Similar Papers
No similar papers found.
Y
Yannis Bendi-Ouis
Centre Inria de l’Université de Bordeaux
D
Dan Dutartre
Centre Inria de l’Université de Bordeaux
Xavier Hinaut
Xavier Hinaut
Inria, Bordeaux, France
Reservoir ComputingRecurrent Neural NetworksLanguage ProcessingBirdsongSensorimotor model