Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the dual challenges of privacy leakage—e.g., data poisoning and model extraction—induced by gradient-based optimization in deep learning, and the poor scalability of black-box methods on large language models (LLMs), this paper proposes BBoxER: the first post-training framework for LLMs that integrates the information bottleneck principle into evolutionary black-box optimization. BBoxER implicitly compresses training data to construct an information bottleneck, enabling parameter updates without accessing raw gradients or training data, thereby providing provable privacy guarantees and non-trivial generalization error bounds. Theoretically grounded and empirically validated, BBoxER demonstrates robustness against both data poisoning and model extraction attacks. With only a modest number of function evaluations, it achieves significant performance gains across diverse reasoning tasks. BBoxER establishes a lightweight, reliable, and gradient-free paradigm for privacy-preserving fine-tuning of LLMs in security-sensitive applications.

Technology Category

Application Category

📝 Abstract
Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, its reliance on large volumes of labeled data raises privacy and security concerns such as susceptibility to data poisoning attacks and the risk of overfitting. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. However, black box methods also pose significant challenges, including poor scalability to high-dimensional parameter spaces, as prevalent in large language models (LLMs), and high computational costs due to reliance on numerous model evaluations. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide strong theoretical bounds on generalization, differential privacy, susceptibility to data poisoning attacks, and robustness to extraction attacks. BBoxER operates on top of pre-trained LLMs, offering a lightweight and modular enhancement suitable for deployment in restricted or privacy-sensitive environments, in addition to non-vacuous generalization guarantees. In experiments with LLMs, we demonstrate empirically that Retrofitting methods are able to learn, showing how a few iterations of BBoxER improve performance and generalize well on a benchmark of reasoning datasets. This positions BBoxER as an attractive add-on on top of gradient-based optimization.
Problem

Research questions and friction points this paper is trying to address.

Address privacy concerns in LLM post-training optimization
Improve scalability of black-box methods for high-dimensional models
Enhance generalization and robustness against data poisoning attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary black-box method for LLM post-training
Implicit compression induces information bottleneck
Lightweight modular enhancement with privacy guarantees