Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Estimating rare population parameters—such as annual hate crime counts—from highly imbalanced textual data remains challenging due to scarcity of labeled instances and prohibitive annotation costs. To address this, we propose a prediction-augmented sampling estimation framework that integrates pre-trained Transformer model predictions as auxiliary variables with classical survey estimators—including Hansen–Hurwitz, difference estimation, and stratified random sampling. This work constitutes the first rigorous unification of deep learning prediction and sampling theory, preserving estimator unbiasedness while substantially improving statistical efficiency. Empirical evaluation on Swedish police textual reports demonstrates accurate estimation of both annual hate crime totals and underreporting rates. The method reduces annotation effort significantly and achieves over 40% lower estimation variance compared to conventional sampling approaches.

Technology Category

Application Category

📝 Abstract
Estimating population parameters in finite populations of text documents can be challenging when obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators using the model predictions as an auxiliary variable. The applicability is demonstrated in Swedish hate crime statistics based on Swedish police reports. Estimates of the yearly number of hate crimes and the police's under-reporting are derived using the Hansen-Hurwitz estimator, difference estimation, and stratified random sampling estimation. We conclude that if labeled training data is available, the proposed method can provide very efficient estimates with reduced time spent on manual annotation.
Problem

Research questions and friction points this paper is trying to address.

Estimating hate crime statistics from imbalanced text data
Reducing manual annotation for text population parameters
Combining neural predictions with survey sampling estimators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer encoder neural network predictions
Survey sampling estimators integration
Stratified random sampling estimation
🔎 Similar Papers
No similar papers found.
H
Hannes Waldetoft
Department of Statistics, Uppsala University
J
Jakob Torgander
Department of Statistics, Uppsala University
Måns Magnusson
Måns Magnusson
Department of Statistics, Uppsala University, Sweden
Bayesian StatisticsProbabilistic Machine LearningText-as-DataComputational Social Science