Toward Responsible Federated Large Language Models: Leveraging a Safety Filter and Constitutional AI

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

In federated learning for large language models (FedLLM), locally hosted harmful data on clients can compromise the safety of the globally aggregated model, undermining reliability in distributed deployments. To address this, we propose the first integration of Safety Filters and Constitutional AI into the FedLLM framework, enabling safety-aware model aggregation and alignment during training: clients pre-filter harmful inputs before local inference and embed constitutional preference alignment prior to parameter aggregation; the server dynamically calibrates safety weights using adversarial safety evaluation (AdvBench). Evaluated on the AdvBench benchmark, our method improves safety performance by over 20% relative to baselines, substantially suppressing unsafe outputs while preserving general model capabilities. This work establishes the first end-to-end, verifiable safety training paradigm for FedLLM.

Technology Category

Application Category

📝 Abstract

Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe responses, remains underexplored in the context of FedLLM. In FedLLM, client data used for training may contain harmful content, leading to unsafe LLMs that generate harmful responses. Aggregating such unsafe LLMs into the global model and distributing them to clients may result in the widespread deployment of unsafe LLMs. To address this issue, we incorporate two well-known RAI methods into FedLLM: the safety filter and constitutional AI. Our experiments demonstrate that these methods significantly enhance the safety of the LLM, achieving over a 20% improvement on AdvBench, a benchmark for evaluating safety performance.

Problem

Research questions and friction points this paper is trying to address.

Ensuring safe responses in Federated LLMs

Preventing harmful content in client data

Enhancing LLM safety using RAI methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated learning for LLMs

Safety filter integration

Constitutional AI enhancement

🔎 Similar Papers

Federated Large Language Models: Current Progress and Future Directions