🤖 AI Summary
To address inefficient data acquisition in high-cost scenarios, this paper proposes a metadata-driven targeted data collection method. It leverages metadata—such as season, time-of-day, and location—from training data to construct a Gaussian process (GP) surrogate model that characterizes the performance response surface of the target model, enabling spatially resolved performance attribution and optimized acquisition policies. This work is the first to integrate GP-based response surface modeling—commonly used in computer experiments—into a meta-learning framework, thereby departing from conventional random or uncertainty-driven paradigms. Evaluated on an aircraft detection task using aerial imagery, the method achieves a 12.7% higher mAP than random sampling under identical data budgets. Moreover, it attains the final performance of full random sampling using only 30% additional data, demonstrating substantial gains in data efficiency.
📝 Abstract
Collecting operationally realistic data to inform machine learning models can be costly. Before collecting new data, it is helpful to understand where a model is deficient. For example, object detectors trained on images of rare objects may not be good at identification in poorly represented conditions. We offer a way of informing subsequent data acquisition to maximize model performance by leveraging the toolkit of computer experiments and metadata describing the circumstances under which the training data was collected (e.g., season, time of day, location). We do this by evaluating the learner as the training data is varied according to its metadata. A Gaussian process (GP) surrogate fit to that response surface can inform new data acquisitions. This meta-learning approach offers improvements to learner performance as compared to data with randomly selected metadata, which we illustrate on both classic learning examples, and on a motivating application involving the collection of aerial images in search of airplanes.