🤖 AI Summary
To address the dual challenges of resource constraints (memory, power, compute) and stringent real-time requirements on edge devices, this paper proposes an algorithm-hardware co-design framework for efficient large language model (LLM) deployment. Methodologically, it introduces a novel joint optimization mechanism spanning both model-level techniques—dynamic sparse inference, adaptive precision scaling, and structured pruning—and system-level innovations—RISC-V-based custom coprocessor design, on-chip memory-aware scheduling, and real-time energy-efficiency modeling. At the hardware level, a scalable 28 nm accelerator is developed to enable always-on edge computing. Evaluated on two commercial edge platforms, the framework achieves up to 11.92× higher inference throughput and 7.36× lower energy consumption, while preserving text generation quality with no statistically significant degradation. The core contribution is an end-to-end LLM deployment paradigm for edge devices that simultaneously ensures low latency, ultra-low power consumption, and strong generalization capability.
📝 Abstract
Deploying large language models (LLMs) on edge devices is crucial for delivering fast responses and ensuring data privacy. However, the limited storage, weight, and power of edge devices make it difficult to deploy LLM-powered applications. These devices must balance latency requirements with energy consumption and model accuracy. In this paper, we first quantify the challenges of deploying LLMs on off-the-shelf edge devices and then we present CLONE, an in-depth algorithm-hardware co-design at both the model- and system-level that intelligently integrates real-time, energy optimization while maintaining robust generality. In order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize in a 28nm scalable hardware accelerator system. We implement and extensively evaluate CLONE on two off-the-shelf edge platforms. Experiments show that CLONE effectively accelerates the inference process up to 11.92x, and saves energy up to 7.36x, while maintaining high-generation.