SelectFormer: Private and Practical Data Selection for Transformers

📅 2023-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In free data markets, selecting high-value training samples prior to交易 under joint model and data privacy constraints remains a critical challenge. Method: We propose the first efficient private data selection framework tailored for Transformers, featuring (i) an MPC-compatible data selection pipeline, (ii) lightweight MLP-based distillation to approximate high-dimensional nonlinear operators—reducing computational overhead—and (iii) Transformer-aware architecture adaptation with a staged parallel MPC scheduling mechanism. Contribution/Results: Our framework reduces Transformer-based data valuation latency in MPC settings from thousands to tens of hours, incurring only ~0.20% accuracy degradation. It is empirically validated across multiple NLP and CV benchmarks, achieving, for the first time, collaborative high-value sample identification without revealing either the model or raw data. This provides a scalable, practical solution for privacy-sensitive data trading.
📝 Abstract
Critical to a free data market is $ extit{private data selection}$, i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.
Problem

Research questions and friction points this paper is trying to address.

Private data selection for Transformer models using MPC.
Reducing computational cost of MPC for Transformer evaluation.
Maintaining model accuracy while ensuring data and model privacy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Private data selection pipeline using MPC
Low-dimension MLPs emulate high-dimensional operators
Parallel, multiphase MPC scheduling reduces delay
🔎 Similar Papers
No similar papers found.