🤖 AI Summary
In scenarios where only specific features are privacy-sensitive, conventional differential privacy (DP) methods suffer from utility degradation due to global noise injection. To address this, we propose FusionDP—a novel framework that leverages foundation models to infer sensitive features and employs a two-stage training strategy: first, foundation-model-assisted feature imputation mitigates information loss; second, an enhanced DP-SGD variant provides rigorous privacy guarantees at the sensitive-feature level. FusionDP achieves strict ε-differential privacy while significantly improving model utility. Empirical evaluation on sepsis prediction and clinical text classification tasks demonstrates that FusionDP consistently outperforms state-of-the-art privacy-preserving baselines—including DP-SGD and PATE—in both AUC and accuracy. These results validate the effectiveness of fine-grained, feature-level privacy control in synergistically optimizing privacy protection and model performance.
📝 Abstract
Ensuring the privacy of sensitive training data is crucial in privacy-preserving machine learning. However, in practical scenarios, privacy protection may be required for only a subset of features. For instance, in ICU data, demographic attributes like age and gender pose higher privacy risks due to their re-identification potential, whereas raw lab results are generally less sensitive. Traditional DP-SGD enforces privacy protection on all features in one sample, leading to excessive noise injection and significant utility degradation. We propose FusionDP, a two-step framework that enhances model utility under feature-level differential privacy. First, FusionDP leverages large foundation models to impute sensitive features given non-sensitive features, treating them as external priors that provide high-quality estimates of sensitive attributes without accessing the true values during model training. Second, we introduce a modified DP-SGD algorithm that trains models on both original and imputed features while formally preserving the privacy of the original sensitive features. We evaluate FusionDP on two modalities: a sepsis prediction task on tabular data from PhysioNet and a clinical note classification task from MIMIC-III. By comparing against privacy-preserving baselines, our results show that FusionDP significantly improves model performance while maintaining rigorous feature-level privacy, demonstrating the potential of foundation model-driven imputation to enhance the privacy-utility trade-off for various modalities.