🤖 AI Summary
Addressing the challenge of assessing population representativeness in high-dimensional survey data, this paper proposes a task-oriented credibility test. Unlike conventional validation paradigms that rely on high-dimensional distribution comparison or full-model reconstruction—both vulnerable to the curse of dimensionality—our method defines a model-specific distance metric tailored to regression tasks, thereby decoupling sample complexity from data dimensionality. Theoretically, we establish correctness and optimal sample efficiency via rigorous statistical learning theory. Empirically, we demonstrate effectiveness, robustness, and computational efficiency across diverse real-world and synthetic survey scenarios. Our key innovation lies in anchoring validation directly to downstream task performance—rather than distributional fidelity—and in introducing a discriminative, dimension-agnostic distance criterion. This paradigm shift significantly enhances the practicality and scalability of quality assessment for high-dimensional survey data.
📝 Abstract
Assessing whether a sample survey credibly represents the population is a critical question for ensuring the validity of downstream research. Generally, this problem reduces to estimating the distance between two high-dimensional distributions, which typically requires a number of samples that grows exponentially with the dimension. However, depending on the model used for data analysis, the conclusions drawn from the data may remain consistent across different underlying distributions. In this context, we propose a task-based approach to assess the credibility of sampled surveys. Specifically, we introduce a model-specific distance metric to quantify this notion of credibility. We also design an algorithm to verify the credibility of survey data in the context of regression models. Notably, the sample complexity of our algorithm is independent of the data dimension. This efficiency stems from the fact that the algorithm focuses on verifying the credibility of the survey data rather than reconstructing the underlying regression model. Furthermore, we show that if one attempts to verify credibility by reconstructing the regression model, the sample complexity scales linearly with the dimensionality of the data. We prove the theoretical correctness of our algorithm and numerically demonstrate our algorithm's performance.