On the Power of Source Screening for Learning Shared Feature Extractors

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning a shared low-dimensional feature representation from multi-source heterogeneous data, where redundant or noisy information can impair statistical efficiency. The authors introduce the concept of an “informative subset” of sources and propose selecting the most informative subset—rather than using all available data—to achieve minimax-optimal estimation of the shared subspace. They develop both a theoretically grounded source selection algorithm and a practical heuristic strategy, proving within a linear subspace modeling framework that the selected subset attains statistical optimality. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness and robustness of the proposed approach.

Technology Category

Application Category

📝 Abstract
Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.
Problem

Research questions and friction points this paper is trying to address.

source screening
shared representation learning
heterogeneous sources
subspace estimation
informative subpopulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

source screening
shared representation learning
minimax optimality
informative subpopulation
subspace estimation
🔎 Similar Papers
No similar papers found.