Data Auctions for Retrieval Augmented Generation

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This paper studies the data selling problem in retrieval-augmented generation (RAG) for generative AI, under constraints of single-assignment (each data point allocated to at most one buyer), coverage-based valuation functions, and no prior knowledge about buyers’ valuations. The objective is to maximize seller revenue while ensuring incentive compatibility. We propose a novel combinatorial auction mechanism whose key innovation is a “data burning” post-processing step: strategically discarding certain allocated data points to eliminate incentives for bid manipulation. This design achieves a (1−1/e)-approximation to the optimal revenue while guaranteeing truthfulness—overcoming the classic efficiency–incentive compatibility trade-off in combinatorial auctions. Experiments on image and text synthesis tasks, as well as real-world datasets, demonstrate that our mechanism significantly outperforms state-of-the-art baselines, improving both allocation efficiency and seller revenue.

Technology Category

Application Category

📝 Abstract

We study the problem of data selling for Retrieval Augmented Generation (RAG) tasks in Generative AI applications. We model each buyer's valuation of a dataset with a natural coverage-based valuation function that increases with the inclusion of more relevant data points that would enhance responses to anticipated queries. Motivated by issues such as data control and prior-free revenue maximization, we focus on the scenario where each data point can be allocated to only one buyer. We show that the problem of welfare maximization in this setting is NP-hard even with two bidders, but design a polynomial-time $(1-1/e)$ approximation algorithm for any number of bidders. Unfortunately, however, this efficient allocation algorithm fails to be incentive compatible. The crux of our approach is a carefully tailored post-processing step called emph{data burning} which retains the $(1-1/e)$ approximation factor but achieves incentive compatibility. Our thorough experiments on synthetic and real-world image and text datasets demonstrate the practical effectiveness of our algorithm compared to popular baseline algorithms for combinatorial auctions.

Problem

Research questions and friction points this paper is trying to address.

Optimizing data allocation for RAG tasks

Maximizing welfare with single-buyer data constraints

Achieving incentive compatibility through data burning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data burning post-processing for incentive compatibility

Coverage-based valuation function for dataset pricing

Polynomial-time approximation algorithm for welfare maximization

🔎 Similar Papers

Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data