🤖 AI Summary
To address critical challenges in deploying generative AI (GenAI) applications on resource-constrained edge devices—namely low system efficiency, high response latency, unfair resource scheduling, and poor performance under static service configurations—this paper proposes the first comprehensive benchmarking framework tailored for edge-centric GenAI systems. The framework supports customizable, multi-application collaborative workflows and introduces two key innovations: an SLO-aware scheduling policy and a custom GPU inference kernel optimized for edge hardware. These enable joint optimization of application-level quality-of-service (measured by SLO attainment rate) and system-level resource utilization. Experimental results demonstrate that greedy scheduling incurs over 30% SLO violations; the custom kernel improves throughput by 2.1×; and the SLO-aware scheduler increases SLO compliance from 68% to 94%. Collectively, this work establishes a foundational methodology for evaluating and optimizing GenAI systems in real-world edge environments.
📝 Abstract
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.