Scaling On-Device GPU Inference for Large Generative Models

πŸ“… 2025-05-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Efficient and privacy-preserving inference of large generative models on resource-constrained end devices (e.g., mobile phones and laptops) remains challenging due to hardware heterogeneity, memory bottlenecks, and computational limitations. Method: This paper introduces the first cross-platform GPU inference framework, featuring a unified GPU API abstraction layer compatible with NVIDIA, AMD, Intel, and mainstream mobile GPUs; a novel lightweight kernel fusion and memory-aware dynamic tensor reuse mechanism; and support for FP16/INT4 quantization, dynamic tensor sharding, and cross-vendor driver adaptation. Contribution/Results: The framework enables real-time inference (>20 tokens/s) of billion-parameter generative models directly on end-device GPUsβ€”scaling model capacity 10–100Γ— beyond prior solutions. On Snapdragon 8 Gen3 and RTX 4060 Laptop GPUs, it achieves 90% end-to-end latency reduction and 10Γ— throughput improvement, effectively overcoming terminal compute and memory constraints.

Technology Category

Application Category

πŸ“ Abstract
Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift--an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
Problem

Research questions and friction points this paper is trying to address.

Enabling on-device GPU inference for large generative models
Overcoming cross-GPU API development challenges for compatibility
Achieving performance improvement in resource-constrained devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized framework for GPU-accelerated inference
Enables execution of larger generative AI models
Ensures broad cross-platform GPU compatibility
πŸ”Ž Similar Papers
No similar papers found.
J
Jiuqiang Tang
Google LLC
R
Raman Sarokin
Google LLC
E
Ekaterina Ignasheva
Meta Platforms, Inc.
G
Grant Jensen
Google LLC
L
Lin Chen
Google LLC
Juhyun Lee
Juhyun Lee
University of Texas at Arlington
Cardiac DevelopmentBiomechanicsOptical Imaging
A
Andrei Kulik
Google LLC
Matthias Grundmann
Matthias Grundmann
Google Research
Computer VisionMachine LearningComputational Video