π€ AI Summary
Professional large language models (LLMs) often produce unreliable outputs on out-of-distribution (OOD) inputs, posing safety risks in critical applications. To address this, we propose a post-hoc OOD detection method based on multi-layer dropout toleranceβthe first to formalize this tolerance as a non-conformity score within the Inductive Conformal Anomaly Detection (ICAD) framework. Leveraging inherent semantic redundancy and polysemy in LLMs, our approach quantifies response stability via ensemble-based stochastic dropout across multiple transformer layers. Theoretically, it guarantees controllable false positive rates under standard conformal prediction assumptions. Extensive experiments on medical-domain LLMs demonstrate significant improvements: AUROC increases by 2β37 percentage points over state-of-the-art baselines, markedly enhancing OOD detection accuracy while strictly bounding the false positive rate.
π Abstract
We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model's dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of $2%$ to $37%$ when treating OOD datapoints as positives and in-domain test datapoints as negatives.