Bench360: Benchmarking Local LLM Inference from 360{deg}

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

📄 PDF

career value

194K/year

🤖 AI Summary

解决本地运行大型语言模型（LLM）时配置选择困难问题，提出Bench360框架，通过自定义任务、数据集及指标，自动评估不同模型、推理引擎和量化级别，提供系统与任务性能的全面评测。

Technology Category

Application Category

📝 Abstract

Running large language models (LLMs) locally is becoming increasingly common. While the growing availability of small open-source models and inference engines has lowered the entry barrier, users now face an overwhelming number of configuration choices. Identifying an optimal configuration -- balancing functional and non-functional requirements -- requires substantial manual effort. While several benchmarks target LLM inference, they are designed for narrow evaluation goals and not user-focused. They fail to integrate relevant system and task-specific metrics into a unified, easy-to-use benchmark that supports multiple inference engines, usage scenarios, and quantization levels. To address this gap, we present Bench360 -- Benchmarking Local LLM Inference from 360{deg}. Bench360 allows users to easily define their own custom tasks along with datasets and relevant task-specific metrics and then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch&server). Bench360 tracks a wide range of metrics, including (1) system metrics -- such as Computing Performance (e.g., latency, throughput), Resource Usage (e.g., energy per query), and Deployment (e.g., cold start time) -- and (2) task-specific metrics such as ROUGE, F1 score or accuracy. We demonstrate Bench360 on four common LLM tasks -- General Knowledge&Reasoning, QA, Summarization and Text-to-SQL -- across three hardware platforms and four state of the art inference engines. Our results reveal several interesting trade-offs between task performance and system-level efficiency, highlighting the differences in inference engines and models. Most importantly, there is no single best setup for local inference, which strongly motivates the need for a framework such as Bench360.

Problem

Research questions and friction points this paper is trying to address.

Optimizing local LLM configurations requires substantial manual effort

Existing benchmarks lack integration of system and task-specific metrics

No unified benchmark supports multiple engines, scenarios, and quantization levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmarking of local LLM configurations

Unified evaluation across multiple inference engines

Custom task definition with system and task metrics

🔎 Similar Papers

No similar papers found.