🤖 AI Summary
This work addresses the problem of robust mean estimation in high-dimensional collaborative learning under untrusted batched data, where $N$ users each contribute $n$ samples, a fraction of which are fully malicious, while the remaining users exhibit either mean shifts or sample-level adversarial corruptions. The paper introduces the first framework for handling such dual corruption in the continuous high-dimensional setting, revealing that the batch structure inherently reduces the impact of malicious users by a factor of $1/\sqrt{n}$. By leveraging the Sum-of-Squares hierarchy, two algorithms are developed to jointly achieve robustness against both distributional heterogeneity and adversarial contamination, attaining the minimax-optimal error rate of $O(\sqrt{\varepsilon/n} + \sqrt{d/(nN)} + \sqrt{\alpha})$, thereby establishing both statistical optimality and robustness.
📝 Abstract
We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$. In this environment, a learner seeks to recover the mean $\mu$ of a true distribution $P$ from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an $\varepsilon$-fraction of users are entirely adversarial, while the remaining ``good''users provide data from distributions that are related to $P$, but deviate by a proximity parameter $\alpha$. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of $\sqrt{\alpha}$, or (2) an $\alpha$-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate $O(\sqrt{\varepsilon/n} + \sqrt{d/nN} + \sqrt{\alpha})$, demonstrating that while heterogeneity $\alpha$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.