🤖 AI Summary
This paper addresses the open-set label shift problem, where the test distribution contains novel classes absent from training, rendering class proportions and novel-class densities unidentifiable. To tackle this challenge, we propose the first identifiable semiparametric density ratio estimation framework. By introducing an overlap modeling mechanism between novel and known classes, our approach ensures identifiability without strong assumptions or prior knowledge, supported by rigorous theoretical guarantees. The method integrates maximum empirical likelihood estimation, asymptotically efficient confidence interval construction, a stable EM-based optimization algorithm, and a posterior-probability-based approximately optimal classifier. Extensive experiments on synthetic and real-world datasets demonstrate substantial improvements in both class proportion estimation accuracy and classification performance, consistently outperforming state-of-the-art methods across all benchmarks.
📝 Abstract
We study the open-set label shift problem, where the test data may include a novel class absent from training. This setting is challenging because both the class proportions and the distribution of the novel class are not identifiable without extra assumptions. Existing approaches often rely on restrictive separability conditions, prior knowledge, or computationally infeasible procedures, and some may lack theoretical guarantees. We propose a semiparametric density ratio model framework that ensures identifiability while allowing overlap between novel and known classes. Within this framework, we develop maximum empirical likelihood estimators and confidence intervals for class proportions, establish their asymptotic validity, and design a stable Expectation-Maximization algorithm for computation. We further construct an approximately optimal classifier based on posterior probabilities with theoretical guarantees. Simulations and a real data application confirm that our methods improve both estimation accuracy and classification performance compared with existing approaches.