BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Current biomedical pre-trained language models inadequately capture complex conceptual structures and factual knowledge embedded in domain-specific knowledge graphs. To address this, we propose BALI—a novel framework that achieves end-to-end representation alignment between language models and biomedical knowledge graphs for the first time. BALI leverages a UMLS-derived subgraph as cross-modal positive supervision and jointly optimizes biomedical language encoders (e.g., PubMedBERT, BioLinkBERT) with a graph encoder built upon UMLS. Crucially, it enables efficient joint pre-training using only a small number of PubMed abstracts. Experimental results demonstrate substantial improvements across downstream tasks—including named entity recognition and relation extraction—while also enhancing the semantic quality and generalizability of entity embeddings. These findings validate the effectiveness and novelty of knowledge-guided, end-to-end alignment as a paradigm for biomedical NLP.

Technology Category

Application Category

📝 Abstract

In recent years, there has been substantial progress in using pretrained Language Models (LMs) on a range of tasks aimed at improving the understanding of biomedical texts. Nonetheless, existing biomedical LLMs show limited comprehension of complex, domain-specific concept structures and the factual information encoded in biomedical Knowledge Graphs (KGs). In this work, we propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel joint LM and KG pre-training method that augments an LM with external knowledge by the simultaneous learning of a dedicated KG encoder and aligning the representations of both the LM and the graph. For a given textual sequence, we link biomedical concept mentions to the Unified Medical Language System (UMLS) KG and utilize local KG subgraphs as cross-modal positive samples for these mentions. Our empirical findings indicate that implementing our method on several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves their performance on a range of language understanding tasks and the quality of entity representations, even with minimal pre-training on a small alignment dataset sourced from PubMed scientific abstracts.

Problem

Research questions and friction points this paper is trying to address.

Improving biomedical language model comprehension of domain-specific concepts

Aligning language models with biomedical knowledge graphs for enhanced representation

Addressing limited factual information understanding in existing biomedical language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint LM and KG pre-training method

Aligns LM and graph representations

Uses UMLS KG subgraphs as cross-modal samples

🔎 Similar Papers

No similar papers found.