On the Power of Source Screening for Learning Shared Feature Extractors

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of learning a shared low-dimensional feature representation from multi-source heterogeneous data, where redundant or noisy information can impair statistical efficiency. The authors introduce the concept of an “informative subset” of sources and propose selecting the most informative subset—rather than using all available data—to achieve minimax-optimal estimation of the shared subspace. They develop both a theoretically grounded source selection algorithm and a practical heuristic strategy, proving within a linear subspace modeling framework that the selected subset attains statistical optimality. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness and robustness of the proposed approach.

Technology Category

Application Category

📝 Abstract

Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.

Problem

Research questions and friction points this paper is trying to address.

source screening

shared representation learning

heterogeneous sources

subspace estimation

informative subpopulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

source screening

shared representation learning

minimax optimality