🤖 AI Summary
This paper studies multi-task learning under the setting where tasks share similar—yet not identical—linear representations and a subset of tasks are adversarial outliers, while the intrinsic representation dimension remains unknown. To address both adaptivity and robustness, we propose two methods: (i) a penalized empirical risk minimization (ERM) approach and (ii) a spectral method based on eigen-decomposition and hard thresholding. We establish the first adaptive robustness theory for similar linear representations, enabling automatic estimation of the unknown intrinsic dimension. Both methods provably match or improve upon single-task learning; they achieve significant gains when task representations are sufficiently similar and outlier tasks are few. Notably, the spectral method attains strict minimax optimality in the absence of outliers, whereas penalized ERM achieves near-minimax optimality generally. Our theoretical guarantees are corroborated by information-theoretic lower bounds and comprehensive numerical experiments.
📝 Abstract
Representation multi-task learning (MTL) has achieved tremendous success in practice. However, the theoretical understanding of these methods is still lacking. Most existing theoretical works focus on cases where all tasks share the same representation, and claim that MTL almost always improves performance. Nevertheless, as the number of tasks grows, assuming all tasks share the same representation is unrealistic. Furthermore, empirical findings often indicate that a shared representation does not necessarily improve single-task learning performance. In this paper, we aim to understand how to learn from tasks with extit{similar but not exactly the same} linear representations, while dealing with outlier tasks. Assuming a known intrinsic dimension, we propose a penalized empirical risk minimization method and a spectral method that are extit{adaptive} to the similarity structure and extit{robust} to outlier tasks. Both algorithms outperform single-task learning when representations across tasks are sufficiently similar and the proportion of outlier tasks is small. Moreover, they always perform at least as well as single-task learning, even when the representations are dissimilar. We provide information-theoretic lower bounds to demonstrate that both methods are nearly extit{minimax} optimal in a large regime, with the spectral method being optimal in the absence of outlier tasks. Additionally, we introduce a thresholding algorithm to adapt to an unknown intrinsic dimension. We conduct extensive numerical experiments to validate our theoretical findings.