A General Framework for Data-Use Auditing of ML Models

📅 2024-07-21
🏛️ Conference on Computer and Communications Security
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address copyright infringement and transparency concerns arising from unauthorized use of third-party data in machine learning model training, this paper proposes the first general-purpose, task-agnostic data usage auditing framework for black-box models. Methodologically, it innovatively integrates arbitrary black-box membership inference techniques with a custom sequential probability ratio test (SPRT), enabling zero assumptions about downstream tasks, strict control over false positive rates (tunable within 0.5%–5%), and cross-model generalization. The framework features a model-agnostic interface, supporting heterogeneous architectures including image classifiers and multimodal large language models. Extensive experiments on ImageNet classifiers and multimodal foundation models demonstrate an average detection accuracy exceeding 92%, with false positive rates consistently meeting user-specified thresholds. This work significantly enhances the quantifiability and reliability of training data provenance auditing.

Technology Category

Application Category

📝 Abstract
Auditing the use of data in training machine-learning (ML) models is an increasingly pressing challenge, as myriad ML practitioners routinely leverage the effort of content creators to train models without their permission. In this paper, we propose a general method to audit an ML model for the use of a data-owner's data in training, without prior knowledge of the ML task for which the data might be used. Our method leverages any existing black-box membership inference method, together with a sequential hypothesis test of our own design, to detect data use with a quantifiable, tunable false-detection rate. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models, namely image classifiers and foundation models.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning Model
Data Usage Transparency
Copyright Issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Detection Method
Black-box Inspection
Error Rate Control
Z
Zonghao Huang
Duke University, Durham, NC, USA
N
N. Gong
Duke University, Durham, NC, USA
M
Michael K. Reiter
Duke University, Durham, NC, USA