🤖 AI Summary
In federated learning for large language models (FedLLM), locally hosted harmful data on clients can compromise the safety of the globally aggregated model, undermining reliability in distributed deployments. To address this, we propose the first integration of Safety Filters and Constitutional AI into the FedLLM framework, enabling safety-aware model aggregation and alignment during training: clients pre-filter harmful inputs before local inference and embed constitutional preference alignment prior to parameter aggregation; the server dynamically calibrates safety weights using adversarial safety evaluation (AdvBench). Evaluated on the AdvBench benchmark, our method improves safety performance by over 20% relative to baselines, substantially suppressing unsafe outputs while preserving general model capabilities. This work establishes the first end-to-end, verifiable safety training paradigm for FedLLM.
📝 Abstract
Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe responses, remains underexplored in the context of FedLLM. In FedLLM, client data used for training may contain harmful content, leading to unsafe LLMs that generate harmful responses. Aggregating such unsafe LLMs into the global model and distributing them to clients may result in the widespread deployment of unsafe LLMs. To address this issue, we incorporate two well-known RAI methods into FedLLM: the safety filter and constitutional AI. Our experiments demonstrate that these methods significantly enhance the safety of the LLM, achieving over a 20% improvement on AdvBench, a benchmark for evaluating safety performance.