Mind the Data Gap: Bridging LLMs to Enterprise Data Integration

📅 2024-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit significantly degraded performance on enterprise private “dark data” compared to public benchmarks, primarily due to domain semantic gaps, scarce labeled data, and missing domain ontologies. Method: We introduce GOBY Benchmark—the first comprehensive benchmark for enterprise data integration—and propose three novel techniques: a hierarchical annotation framework, a runtime class learning mechanism, and an ontology synthesis method. These enable end-to-end domain adaptation via LLM fine-tuning, dynamic semantic modeling, and ontology-guided data alignment. Contribution/Results: Experiments demonstrate that our approach elevates data integration accuracy on GOBY to parity with leading public benchmarks—closing the performance gap of LLMs on core enterprise data tasks for the first time. GOBY establishes a reproducible evaluation framework and a principled technical paradigm for deploying LLMs on high-value private enterprise data.

Technology Category

Application Category

📝 Abstract
Leading large language models (LLMs) are trained on public data. However, most of the world's data is dark data that is not publicly accessible, mainly in the form of private organizational or enterprise data. We show that the performance of methods based on LLMs seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a new benchmark dataset, the GOBY Benchmark, to advance discovery in enterprise data integration. Based on our experience with this enterprise benchmark, we propose techniques to uplift the performance of LLMs on enterprise data, including (1) hierarchical annotation, (2) runtime class-learning, and (3) ontology synthesis. We show that, once these techniques are deployed, the performance on enterprise data becomes on par with that of public data. The Goby benchmark can be obtained at https://goby-benchmark.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Language Models
Private Data Adaptation
Real-world Applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

GOBY Standard
Private Data Performance
Enhanced Classification
🔎 Similar Papers
No similar papers found.