FedAgentBench: Towards Automating Real-world Federated Medical Image Analysis with Server-Client LLM Agents

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Federated learning (FL) faces practical deployment bottlenecks in medical image analysis, including high manual coordination overhead, cross-institutional data and label heterogeneity, and empirical reliance for algorithm selection. To address these challenges, we propose FedAgentBench—the first multi-agent framework designed for the end-to-end FL pipeline in healthcare. It orchestrates server-side and client-side LLM agents to autonomously perform client selection, data preprocessing, label alignment, and FL algorithm scheduling. The framework supports 40 FL algorithms and task orchestration across six medical imaging modalities, accompanied by a standardized evaluation benchmark. Experimental evaluation across 24 open- and closed-source LLMs reveals that GPT-4.1 and DeepSeek-V3 exhibit strong agent capabilities, significantly reducing human intervention. However, performance remains suboptimal on complex reasoning tasks under implicit objectives, indicating room for improvement.

Technology Category

Application Category

📝 Abstract

Federated learning (FL) allows collaborative model training across healthcare sites without sharing sensitive patient data. However, real-world FL deployment is often hindered by complex operational challenges that demand substantial human efforts. This includes: (a) selecting appropriate clients (hospitals), (b) coordinating between the central server and clients, (c) client-level data pre-processing, (d) harmonizing non-standardized data and labels across clients, and (e) selecting FL algorithms based on user instructions and cross-client data characteristics. However, the existing FL works overlook these practical orchestration challenges. These operational bottlenecks motivate the need for autonomous, agent-driven FL systems, where intelligent agents at each hospital client and the central server agent collaboratively manage FL setup and model training with minimal human intervention. To this end, we first introduce an agent-driven FL framework that captures key phases of real-world FL workflows from client selection to training completion and a benchmark dubbed FedAgentBench that evaluates the ability of LLM agents to autonomously coordinate healthcare FL. Our framework incorporates 40 FL algorithms, each tailored to address diverse task-specific requirements and cross-client characteristics. Furthermore, we introduce a diverse set of complex tasks across 201 carefully curated datasets, simulating 6 modality-specific real-world healthcare environments, viz., Dermatoscopy, Ultrasound, Fundus, Histopathology, MRI, and X-Ray. We assess the agentic performance of 14 open-source and 10 proprietary LLMs spanning small, medium, and large model scales. While some agent cores such as GPT-4.1 and DeepSeek V3 can automate various stages of the FL pipeline, our results reveal that more complex, interdependent tasks based on implicit goals remain challenging for even the strongest models.

Problem

Research questions and friction points this paper is trying to address.

Automating real-world federated medical image analysis with intelligent server-client agents

Addressing complex operational challenges in federated learning deployment across healthcare sites

Evaluating LLM agents' ability to autonomously coordinate federated learning workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents automate federated learning workflows

Framework integrates 40 specialized FL algorithms

Benchmark tests autonomous coordination across healthcare datasets

🔎 Similar Papers

Federated Large Language Models: Current Progress and Future Directions