Socio-Conformal Calibration in Complex Survey Data: Marginal Validity Is Not Enough for Subgroup Reliability

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study addresses the need for reliable uncertainty estimates across demographic subgroups in social survey applications of machine learning, rather than merely ensuring marginal validity overall. Focusing on ordinal conformal prediction under complex sampling designs, the work systematically evaluates the trade-offs between subgroup coverage reliability and predictive efficiency among standard, Mondrian, and regularized Mondrian approaches. Experiments based on 100 non-overlapping splits of Pew Research Center’s American Trends Panel data reveal that while standard methods achieve nominal marginal coverage, subgroup coverage rates vary by up to 13 percentage points. Mondrian calibration intensifies the fairness–efficiency trade-off, whereas the proposed threshold regularization mitigates calibration instability but offers limited gains in fairness. The findings underscore that marginal validity alone is insufficient for subgroup reliability and offer a new perspective toward fair and robust uncertainty quantification.

📝 Abstract

Machine-learning systems used in survey-based social measurement require uncertainty estimates that are reliable across population subgroups, not merely valid in aggregate. We study ordinal conformal prediction for five-level AI-attitude forecasting on the Pew American Trends Panel (Wave 152; n=4,591; 12 race x education subgroups), comparing standard split conformal, Mondrian (group-specific) conformal, and a regularized Mondrian comparator across 100 respondent-disjoint splits with survey-weighted evaluation. Standard conformal achieves nominal marginal coverage for all four base predictors but leaves weighted subgroup gaps of ~13 percentage points. For the strongest predictor (XGBoost), Mondrian worsens the fairness-efficiency trade-off: weighted set size rises by +0.036 (dz =1.66) while the weighted subgroup gap grows by +0.013 (dz =0.30). A regularized comparator that shrinks group thresholds toward the global quantile mitigates this instability (Delta gap = -0.001, Delta size = +0.012) but does not yield a decisive fairness gain. Failure analysis traces the mechanism to calibration-cell fragmentation interacting with group-specific confidence mismatch. The negative result persists across alternate outcome codings and subgroup granularities, demonstrating that nominal marginal validity is insufficient for subgroup reliability and that naive group-specific calibration is not a dependable fairness remedy in complex survey settings.

Problem

Research questions and friction points this paper is trying to address.

conformal prediction

subgroup reliability

survey data

fairness

calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

conformal prediction

subgroup reliability

survey-weighted evaluation