EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of self-attention, excessive KV cache memory pressure, and difficulty in optimizing time-to-first-token (TTFT) when deploying long-sequence large language models (LLMs) on edge devices, this paper proposes EdgeInfinite—a lightweight hardware-cooperative optimization framework. Methodologically, it introduces: (1) a segmented supervised fine-tuning (SFT) strategy that enhances instruction-following capability with only 0.1% parameter updates; (2) a fixed-shape computation graph design tailored for neural processing units (NPUs) and enabling fine-grained quantization; and (3) a scenario-aware KV cache management mechanism. Evaluated on long-context benchmarks and real-world mobile tasks, EdgeInfinite achieves low-latency, high-accuracy on-device inference without model retraining or specialized infrastructure. It significantly reduces TTFT and memory footprint while preserving model accuracy—outperforming existing approaches in both efficiency and practical deployability.

Technology Category

Application Category

📝 Abstract
Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLMs for edge devices with limited resources
Reducing computational and memory costs for long-sequence tasks
Improving instruction-following ability and mobile-specific optimizations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmented Supervised Fine-Tuning for long sequences
Fine-grained post-training quantization for efficiency
Fixed-shape computation graph for memory balance
🔎 Similar Papers
No similar papers found.
J
Jiyu Chen
vivo AI Lab, Zhejiang University
P
Poh Seng Lim
Mediatek Inc.
S
Shuang Peng
vivo AI Lab
D
Daxiong Luo
vivo AI Lab
J
JungHau Foo
Mediatek Inc.
Y
Yap Deep
Mediatek Inc.
T
Timothy Lee Jun Jie
Mediatek Inc.
K
Kelvin Teh Kae Wen
Mediatek Inc.
F
Fan Yang
vivo AI Lab
D
Danyu Feng
vivo AI Lab
Hao-Yun Chen
Hao-Yun Chen
MediaTek Inc.
Machine learningDeep learningComputer vision
P
Peng-Wen Chen
Mediatek Inc.
F
Fangyuan Li
vivo AI Lab
Xiaoxin Chen
Xiaoxin Chen
Coriell Institute for Medical Research
W
Wong Wai Mun
Mediatek Inc.