FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of cross-domain adaptability evaluation for large language models (LLMs) in federated learning (FL). We introduce the first FL fine-tuning benchmark covering four domains: general NLP, finance, healthcare, and programming. Our methodology comprises a comprehensive LLM-oriented federated instruction-tuning evaluation framework, featuring standardized multi-domain data partitioning, a privacy-preserving distributed fine-tuning protocol (built on Flower), domain-specific evaluation metrics, and an open-source leaderboard. Leveraging aggregation strategies including FedAvg and FedOpt, we systematically evaluate 26 mainstream pre-trained LLMs, uncovering coupling patterns among model scale, aggregation mechanism, and domain characteristics. This work bridges a critical gap in the co-optimization of FL and LLMs, providing both a methodological foundation and an empirical benchmark for privacy-aware, industry-specific LLM deployment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating federated fine-tuning of LLMs across diverse domains
Assessing compatibility of pre-trained LLMs in federated learning settings
Addressing data scarcity and privacy in domain-specific LLM development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning for decentralized LLM fine-tuning
Cross-domain benchmark with diverse datasets
Comprehensive comparison of 26 LLMs
🔎 Similar Papers
No similar papers found.
Y
Yan Gao
Flower Labs, University of Cambridge
M
Massimo Roberto Scamarcia
Entrust Corp
J
Javier Fernandez-marques
Flower Labs, University of Cambridge
Mohammad Naseri
Mohammad Naseri
Flower Labs
PrivacySecurityTrustworthy Machine LearningFederated Learning
Chong Shen Ng
Chong Shen Ng
University of Twente
direct numerical simulationmultiphase flowsturbulent convection
Dimitris Stripelis
Dimitris Stripelis
Flower Labs
Federated AIFederated LearningMachine LearningDatabase SystemsData Integration
Zexi Li
Zexi Li
Alibaba Group
Deep LearningLarge Language ModelsFederated Learning
T
Tao Shen
Zhejiang University
J
Jiamu Bai
Penn State University
Daoyuan Chen
Daoyuan Chen
Alibaba Group
Efficient Machine LearningHuman-Centric MLLarge Language ModelsMultimodality
Z
Zikai Zhang
University of Nevada, Reno
R
Rui Hu
University of Nevada, Reno
I
InSeo Song
Gachon University
L
Lee KangYoon
Gachon University
Hong Jia
Hong Jia
Lecturer (Assistant Professor), University of Auckland; University of Melbourne
On-Device MLHuman-Centred AIMobile ComputingMobile Health
Ting Dang
Ting Dang
Senior Lecturer in AI for Health, The University of Melbourne
Mobile HealthAudio ProcessingAffective ComputingTime Series ModellingWearable Sensing
Junyan Wang
Junyan Wang
Postdoctoral Research Fellow, University of Adelaide
Deep LearningComputer VisionGenerative AI
Z
Zheyuan Liu
The University of Adelaide
D
Daniel J. Beutel
Flower Labs
Lingjuan Lyu
Lingjuan Lyu
Sony
Foundation ModelsFederated LearningResponsible AI
N
Nicholas D. Lane
Flower Labs, University of Cambridge