Two Point Correlation Function Estimation with Contaminated Data

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Imperfect target selection and spatially structured contamination in imaging surveys can severely bias estimates of the two-point correlation function (2PCF). To address this, this work proposes the Prediction-Augmented Landy–Szalay (PP–LS) estimator, which combines noisy labels from the full photometric sample with true labels from a small, high-fidelity spectroscopic subsample. By applying a residual-driven weighting scheme to correct pair counts, PP–LS effectively debiases the 2PCF without requiring probability calibration, misclassification rates, or explicit contamination modeling. The method preserves the standard random-catalog normalization and consistently recovers the ideal-label 2PCF under simple random sampling. Simulations demonstrate that PP–LS eliminates the bias inherent in conventional estimators while achieving significantly lower variance than approaches relying solely on spectroscopic data, offering both statistical rigor and computational efficiency.

Technology Category

Application Category

📝 Abstract

The two-point correlation function (2PCF) is a cornerstone of precision cosmology, yet its estimation from imaging surveys is vulnerable to contamination and incompleteness arising from imperfect target selection and pipeline-level inclusion decisions. In practice, the scientific target is a physically defined population, while the working catalog is constructed from noisy measurements and selection cuts, leading to mismatches between true and observed inclusion. These errors are often spatially structured, correlating with survey depth, observing conditions, and foregrounds, and can imprint spurious large-scale power or suppress the true clustering signal. High-resolution spectroscopic samples provide gold-standard inclusion in the target population but are typically available for only a small subset of objects. We introduce a prediction-powered Landy--Szalay (PP--LS) estimator that combines noisy inclusion labels across the full catalog with exact labels on a small spectroscopic subset while preserving the standard random-catalog normalization for survey geometry and selection. PP--LS debiases pair counts using residual-based, design-weighted corrections computed only on the labeled subset, requiring no probability calibration, known misclassification rates, or explicit modeling of contamination. Under simple random sampling of the labeled subset, we establish recovery of the oracle (true-label) Landy--Szalay pair counts and thus consistency for the target 2PCF. In simulations with clustered and spatially structured contaminants, PP--LS removes the bias of naive catalog-level estimators while achieving substantially lower variance than spectroscopic-only clustering. The resulting estimator is statistically principled, computationally lightweight, and integrates directly with standard pair-counting pipelines, enabling robust clustering inference in next-generation surveys.

Problem

Research questions and friction points this paper is trying to address.

two-point correlation function

contaminated data

imperfect target selection

spatially structured errors

clustering bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

two-point correlation function

contamination-robust estimation

prediction-powered inference