🤖 AI Summary
To address the challenge of balancing sampling efficiency and information preservation in offline learning under data stream settings, this paper proposes a prediction-oriented information-theoretic subsampling framework. Unlike conventional approaches that maximize input data entropy, our method guides sampling decisions by minimizing posterior uncertainty of downstream prediction tasks. It incorporates a lightweight model-aware mechanism to ensure sampling stability and computational tractability. Extensive experiments on time-series forecasting and anomaly detection demonstrate that the proposed method significantly outperforms existing information-theoretic baselines: it achieves an average 12.7% reduction in prediction error at equivalent sampling rates, while maintaining scalable computational overhead. The core contribution lies in the first explicit formulation of predictive uncertainty as a principled subsampling criterion—unifying theoretical interpretability with practical performance.
📝 Abstract
Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.