🤖 AI Summary
This paper addresses the high cost and low efficiency of acquiring labeled data in resource-constrained settings. We propose an active learning–driven market mechanism that formulates label procurement as a budget-constrained optimization problem with a performance improvement threshold. Within a single-buyer–multiple-seller framework, we jointly design market clearing, active learning strategies—specifically variance- and committee-based query selection—and a differentiated pricing scheme, constituting the first such integration. Evaluated on real-world real estate price prediction and energy demand forecasting tasks, our method achieves significantly higher model performance using fewer labels than random sampling, while demonstrating strong robustness to label noise and distributional shift. Our core contribution is the establishment of the first data procurement paradigm unifying active learning with microeconomic market mechanisms, enabling Pareto-optimal trade-offs between annotation cost and model utility.
📝 Abstract
We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.