๐ค AI Summary
Rare disease prediction faces challenges including severe data scarcity, ambiguous disease descriptions, and misalignment between clinical text and standardized terminology knowledge. Method: We propose the Unified Discrete Coding (UDC) framework, which constructs a shared discrete semantic space via condition-aware vector quantization to jointly model clinical text and electronic health records. UDC employs cross-domain hard negative contrastive learning and co-teacher distillation to achieve bidirectional semantic alignment, abandoning continuous latent representations in favor of a discrete encoderโdecoder architecture with explicit semantic alignment. Contribution/Results: Evaluated on three real-world medical datasets, UDC significantly outperforms state-of-the-art methods, improving average accuracy and F1-score for rare disease prediction by 6.2% and 7.8%, respectively. This work is the first to empirically validate the efficacy and generalizability of discrete semantic modeling for few-shot medical tasks.
๐ Abstract
Accurate healthcare prediction is essential for improving patient outcomes. Existing work primarily leverages advanced frameworks like attention or graph networks to capture the intricate collaborative (CO) signals in electronic health records. However, prediction for rare diseases remains challenging due to limited co-occurrence and inadequately tailored approaches. To address this issue, this paper proposes UDC, a novel method that unveils discrete clues to bridge consistent textual knowledge and CO signals within a unified semantic space, thereby enriching the representation semantics of rare diseases. Specifically, we focus on addressing two key sub-problems: (1) acquiring distinguishable discrete encodings for precise disease representation and (2) achieving semantic alignment between textual knowledge and the CO signals at the code level. For the first sub-problem, we refine the standard vector quantized process to include condition awareness. Additionally, we develop an advanced contrastive approach in the decoding stage, leveraging synthetic and mixed-domain targets as hard negatives to enrich the perceptibility of the reconstructed representation for downstream tasks. For the second sub-problem, we introduce a novel codebook update strategy using co-teacher distillation. This approach facilitates bidirectional supervision between textual knowledge and CO signals, thereby aligning semantically equivalent information in a shared discrete latent space. Extensive experiments on three datasets demonstrate our superiority.