VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language model (LLM) inference on edge devices faces significant challenges due to high computational overhead and memory pressure. To address this, we propose VEDA, a hardware–software co-design framework. VEDA introduces a voting-based KV cache eviction algorithm enabling O(1)-complexity dynamic cache management; a reconfigurable processing element array with flexible multiplication dataflow to efficiently support variable-length sequences and multidimensional workloads; and element-wise serial scheduling to optimize nonlinear operations such as softmax and LayerNorm. Experimental results demonstrate that VEDA substantially reduces inference latency and hardware resource consumption. On edge platforms, it achieves superior energy efficiency compared to state-of-the-art approaches, thereby enhancing real-time responsiveness and strengthening on-device privacy preservation for localized LLM inference.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM inference by algorithm-hardware-dataflow tri-optimizations. We propose a novel voting-based KV cache eviction algorithm, balancing hardware efficiency and algorithm accuracy by adaptively identifying unimportant kv vectors. From a dataflow perspective, we introduce a flexible-product dataflow and a runtime reconfigurable PE array for matrix-vector multiplication. The proposed approach effectively handles the diverse dimensional requirements and solves the challenges of incrementally varying sequence lengths. Additionally, an element-serial scheduling scheme is proposed for nonlinear operations, such as softmax and layer normalization (layernorm). Results demonstrate a substantial reduction in latency, accompanied by a significant decrease in hardware complexity, from O(N) to O(1). The proposed solution is realized in a custom-designed accelerator, VEDA, which outperforms existing hardware platforms. This research represents a significant advancement in LLM inference on resource-constrained edge devices, facilitating real-time processing, enhancing data privacy, and enabling model customization.
Problem

Research questions and friction points this paper is trying to address.

Efficient LLM inference for edge deployment
Balancing hardware efficiency and algorithm accuracy
Handling diverse dimensional requirements in matrix-vector multiplication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voting-based KV cache eviction algorithm
Flexible-product dataflow and reconfigurable PE array
Element-serial scheduling for nonlinear operations
Z
Zhican Wang
State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
H
Hongxiang Fan
Imperial College London, United Kingdom
H
Haroon Waris
Institute of Space Technology, Pakistan
G
Gang Wang
State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
Z
Zhenyu Li
State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
Jianfei Jiang
Jianfei Jiang
shanghai jiao tong university
digital circuit
Y
Yanan Sun
State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
G
Guanghui He
State Key Laboratory of Micro/Nano Engineering Science, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China