FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work addresses the uplink communication bottleneck in federated fine-tuning of large language models on edge devices, caused by heterogeneous bandwidth and non-IID data distributions, which renders uniform compression ineffective at preserving critical rare signals. To overcome this, we propose Fed-FSTQ, the first framework to integrate Fisher information-guided non-uniform token quantization into federated fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, enabling importance-aware token selection and mixed-precision quantization that drastically reduces redundant transmission while retaining essential information. Notably, it requires no modification to server-side aggregation and is plug-and-play compatible with parameter-efficient fine-tuning methods like LoRA. Experiments on multilingual and medical question-answering tasks show that, compared to standard LoRA baselines, Fed-FSTQ reduces uplink communication by 46×, shortens end-to-end convergence time by 52%, and achieves a 1.55× speedup in inference.
📝 Abstract
Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.
Problem

Research questions and friction points this paper is trying to address.

Federated Fine-Tuning
Communication Efficiency
Large Language Models
Edge Devices
Non-IID Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fisher-guided quantization
token selection
communication-efficient federated learning
mixed-precision quantization
parameter-efficient fine-tuning
C
Changyu Li
School of Computing and Information Technology, Great Bay University, Dongguan, China
S
Shuanghong Huang
Beijing Institute of Technology, Beijing, China
J
Jiashen Liu
University of Warwick, Coventry, U.K.
M
Ming Lei
School of Computing and Information Technology, Great Bay University, Dongguan, China
J
Jidu Xing
City University of Hong Kong (Dongguan), Dongguan 523808, China
Kaishun Wu
Kaishun Wu
IEEE Fellow; Professor of Data Science and Analytics/Internet of Things, HKUST(Guangzhou)
Internet of ThingsMobile ComputingWireless Sensing
L
Lu Wang
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
F
Fei Luo
School of Computing and Information Technology, Great Bay University, Dongguan, China