FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language model (LLM)-generated document representations suffer from high dimensionality, substantial computational overhead, strong generalization but weak domain specificity. To address this, we propose a Bayesian optimization–guided early-fusion framework that jointly models LLM embeddings with structured semantic information from local and external knowledge graphs (e.g., Wikidata). Our method learns low-dimensional, interpretable weight assignments that preserve semantic richness while significantly enhancing domain adaptability. Integrated with an AutoML classifier for downstream task training, the framework achieves state-of-the-art or competitive performance against specialized LLM embedding baselines across six cross-domain datasets. It reduces computational complexity by 37%–62% and improves model decision interpretability through transparent, knowledge-informed fusion.

Technology Category

Application Category

📝 Abstract

Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.

Problem

Research questions and friction points this paper is trying to address.

High-dimensional LLM embeddings are inefficient for domain tasks

Lack of domain-specific knowledge in generic LLM representations

Need for interpretable low-dimensional document embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses LLM embeddings with domain-specific knowledge

Uses Bayesian optimization for low-dimensional representations

Enhances classification with interpretable early-fusion weights

🔎 Similar Papers

No similar papers found.