🤖 AI Summary
In free data markets, selecting high-value training samples prior to交易 under joint model and data privacy constraints remains a critical challenge. Method: We propose the first efficient private data selection framework tailored for Transformers, featuring (i) an MPC-compatible data selection pipeline, (ii) lightweight MLP-based distillation to approximate high-dimensional nonlinear operators—reducing computational overhead—and (iii) Transformer-aware architecture adaptation with a staged parallel MPC scheduling mechanism. Contribution/Results: Our framework reduces Transformer-based data valuation latency in MPC settings from thousands to tens of hours, incurring only ~0.20% accuracy degradation. It is empirically validated across multiple NLP and CV benchmarks, achieving, for the first time, collaborative high-value sample identification without revealing either the model or raw data. This provides a scalable, practical solution for privacy-sensitive data trading.
📝 Abstract
Critical to a free data market is $ extit{private data selection}$, i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.