🤖 AI Summary
This work addresses the operational reliability of Android malware classifiers under distributional shift, revealing that state-of-the-art models—despite high accuracy—suffer from severe miscalibration, leading to inefficient active learning resource allocation and missed detection of high-risk samples. To this end, we introduce AURORA, the first evaluation framework explicitly designed for confidence-quality validation. It defines multidimensional operational resilience metrics: confidence calibration, drift modeling, temporal robustness, active learning budget efficiency, and selective classification failure detection. Comprehensive evaluation across diverse real-world datasets demonstrates pervasive confidence-error misalignment and insufficient long-term stability in mainstream models. Our findings establish that classifier design must be fundamentally restructured around trustworthiness—not just accuracy—to ensure robust, reliable mobile security AI. This work provides both theoretical foundations and a practical paradigm for building trustworthy, operationally resilient Android malware detection systems.
📝 Abstract
The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While TESSERACT established the importance of temporal evaluation, we take a complementary direction by investigating whether malware classifiers maintain reliable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose AURORA, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. AURORA subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budget on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. AURORA is further complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility we observe in state-of-the-art frameworks across datasets of varying drift severity suggests the need for a return to the whiteboard.