🤖 AI Summary
Estimating rare population parameters—such as annual hate crime counts—from highly imbalanced textual data remains challenging due to scarcity of labeled instances and prohibitive annotation costs. To address this, we propose a prediction-augmented sampling estimation framework that integrates pre-trained Transformer model predictions as auxiliary variables with classical survey estimators—including Hansen–Hurwitz, difference estimation, and stratified random sampling. This work constitutes the first rigorous unification of deep learning prediction and sampling theory, preserving estimator unbiasedness while substantially improving statistical efficiency. Empirical evaluation on Swedish police textual reports demonstrates accurate estimation of both annual hate crime totals and underreporting rates. The method reduces annotation effort significantly and achieves over 40% lower estimation variance compared to conventional sampling approaches.
📝 Abstract
Estimating population parameters in finite populations of text documents can be challenging when obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators using the model predictions as an auxiliary variable. The applicability is demonstrated in Swedish hate crime statistics based on Swedish police reports. Estimates of the yearly number of hate crimes and the police's under-reporting are derived using the Hansen-Hurwitz estimator, difference estimation, and stratified random sampling estimation. We conclude that if labeled training data is available, the proposed method can provide very efficient estimates with reduced time spent on manual annotation.