FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study addresses the lack of cross-domain adaptability evaluation for large language models (LLMs) in federated learning (FL). We introduce the first FL fine-tuning benchmark covering four domains: general NLP, finance, healthcare, and programming. Our methodology comprises a comprehensive LLM-oriented federated instruction-tuning evaluation framework, featuring standardized multi-domain data partitioning, a privacy-preserving distributed fine-tuning protocol (built on Flower), domain-specific evaluation metrics, and an open-source leaderboard. Leveraging aggregation strategies including FedAvg and FedOpt, we systematically evaluate 26 mainstream pre-trained LLMs, uncovering coupling patterns among model scale, aggregation mechanism, and domain characteristics. This work bridges a critical gap in the co-optimization of FL and LLMs, providing both a methodological foundation and an empirical benchmark for privacy-aware, industry-specific LLM deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating federated fine-tuning of LLMs across diverse domains

Assessing compatibility of pre-trained LLMs in federated learning settings

Addressing data scarcity and privacy in domain-specific LLM development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning for decentralized LLM fine-tuning

Cross-domain benchmark with diverse datasets

Comprehensive comparison of 26 LLMs

🔎 Similar Papers

Federated Large Language Models: Current Progress and Future Directions