Deploying Open-Source Large Language Models: A performance Analysis

📅 2024-09-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This study addresses the challenge of efficiently deploying open-source large language models (e.g., Mistral, LLaMA) on heterogeneous GPU hardware. We introduce the first practical, multi-dimensional performance benchmark spanning model scales (7B–70B), hardware configurations (A100, H100, etc.), and inference backends (vLLM). Our methodology integrates CUDA acceleration, GPU memory consumption modeling, and joint evaluation of throughput and time-to-first-token latency to systematically quantify performance boundaries across resource configurations. The key contribution is an empirically grounded model selection and deployment recommendation table—filling a critical gap in evidence-based LLM engineering practice. Experimental results demonstrate up to 240 tokens/sec/GPU throughput for 7B models on A100 GPUs. The benchmark provides reproducible, scalable guidance for both private and public institutions deploying LLMs in production environments.

Technology Category

Application Category

📝 Abstract

Since the release of ChatGPT in November 2022, large language models (LLMs) have seen considerable success, including in the open-source community, with many open-weight models available. However, the requirements to deploy such a service are often unknown and difficult to evaluate in advance. To facilitate this process, we conducted numerous tests at the Centre Inria de l'Universit'e de Bordeaux. In this article, we propose a comparison of the performance of several models of different sizes (mainly Mistral and LLaMa) depending on the available GPUs, using vLLM, a Python library designed to optimize the inference of these models. Our results provide valuable information for private and public groups wishing to deploy LLMs, allowing them to evaluate the performance of different models based on their available hardware. This study thus contributes to facilitating the adoption and use of these large language models in various application domains.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Performance Optimization

Hardware Adaptability

Innovation

Methods, ideas, or system contributions that make the work stand out.

vLLM tool

large language models performance

hardware-specific guidance

🔎 Similar Papers

No similar papers found.

Together AI

$160,000 - $230,000 + equity + benefits

San Francisco, Singapore, Amsterdam / Remote

Authors to Follow