Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Estimating rare population parameters—such as annual hate crime counts—from highly imbalanced textual data remains challenging due to scarcity of labeled instances and prohibitive annotation costs. To address this, we propose a prediction-augmented sampling estimation framework that integrates pre-trained Transformer model predictions as auxiliary variables with classical survey estimators—including Hansen–Hurwitz, difference estimation, and stratified random sampling. This work constitutes the first rigorous unification of deep learning prediction and sampling theory, preserving estimator unbiasedness while substantially improving statistical efficiency. Empirical evaluation on Swedish police textual reports demonstrates accurate estimation of both annual hate crime totals and underreporting rates. The method reduces annotation effort significantly and achieves over 40% lower estimation variance compared to conventional sampling approaches.

Technology Category

Application Category

📝 Abstract

Estimating population parameters in finite populations of text documents can be challenging when obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators using the model predictions as an auxiliary variable. The applicability is demonstrated in Swedish hate crime statistics based on Swedish police reports. Estimates of the yearly number of hate crimes and the police's under-reporting are derived using the Hansen-Hurwitz estimator, difference estimation, and stratified random sampling estimation. We conclude that if labeled training data is available, the proposed method can provide very efficient estimates with reduced time spent on manual annotation.

Problem

Research questions and friction points this paper is trying to address.

Estimating hate crime statistics from imbalanced text data

Reducing manual annotation for text population parameters

Combining neural predictions with survey sampling estimators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer encoder neural network predictions

Survey sampling estimators integration

Stratified random sampling estimation

🔎 Similar Papers

No similar papers found.