π€ AI Summary
This work addresses the absence of methods in federated fine-tuning of large language models that simultaneously preserve client privacy and enable client-level data provenance via watermarking. The authors propose FedAttr, the first protocol to jointly achieve privacy protection and fine-grained watermark attribution within a secure aggregation framework. FedAttr generates unbiased update estimates using a paired subset differencing mechanism and integrates secure aggregation query differencing, watermark differencing scores, and cross-round fusion via Stoufferβs method to accurately identify clients whose data contributed to watermark insertion, all while theoretically bounding mutual information leakage. Experiments demonstrate that FedAttr attains 100% true positive rate and 0% false positive rate, improving true positive rate by at least 44.4% or reducing false positive rate by 19.1% over baselines, with only a 6.3% increase in training overhead.
π Abstract
Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across different users. FL mainly ensures privacy through secure aggregation (SA), which allows the server to aggregate updates while keeping clients' updates private. This mechanism preserves privacy but makes it difficult to identify which client trained on watermarked documents. In this work, we propose FedAttr, a new client-level attribution protocol for FL. FedAttr identifies which clients trained on watermarked data via a paired-subset-difference mechanism, while preserving the privacy guarantees of SA and FL performance. FedAttr proceeds in three steps: (i) estimate each client's update by differencing two SA queries, (ii) score the estimate with the watermark detector via differential scoring, and (iii) combine scores across rounds via Stouffer method. We theoretically show that FedAttr produces an unbiased estimator of each client's update with bounded mutual information leakage (i.e., $O(d^*/N)$ per-round update). Moreover, FedAttr empirically achieves 100% TPR and 0% FPR, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only 6.3% overhead relative to FL training time. Ablation studies confirm that FedAttr is robust to protocol parameters and configurations.