FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the uplink communication bottleneck in federated fine-tuning of large language models on edge devices, caused by heterogeneous bandwidth and non-IID data distributions, which renders uniform compression ineffective at preserving critical rare signals. To overcome this, we propose Fed-FSTQ, the first framework to integrate Fisher information-guided non-uniform token quantization into federated fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, enabling importance-aware token selection and mixed-precision quantization that drastically reduces redundant transmission while retaining essential information. Notably, it requires no modification to server-side aggregation and is plug-and-play compatible with parameter-efficient fine-tuning methods like LoRA. Experiments on multilingual and medical question-answering tasks show that, compared to standard LoRA baselines, Fed-FSTQ reduces uplink communication by 46×, shortens end-to-end convergence time by 52%, and achieves a 1.55× speedup in inference.

📝 Abstract

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.

Problem

Research questions and friction points this paper is trying to address.

Federated Fine-Tuning

Communication Efficiency

Large Language Models

Edge Devices

Non-IID Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fisher-guided quantization

token selection

communication-efficient federated learning