InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

To address the high computational cost and frequent omission of complex relations in Text2KG tasks—caused by iterative LLM prompting—this paper proposes InvertiTune. Its core innovation is inverse training-data construction: authentic subgraphs are first extracted from knowledge bases, filtered for noise, and then used to guide controllable LLM generation of high-quality, descriptive long-form text, thereby constructing large-scale, high-fidelity (text, graph) pairs aligned with real-world data distributions. Leveraging this dataset, lightweight models achieve end-to-end, single-step knowledge graph generation. Experiments demonstrate that InvertiTune significantly outperforms both zero-shot large language models and state-of-the-art methods on CE12k and CrossEval-1200, achieving superior generation quality and strong cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.

Problem

Research questions and friction points this paper is trying to address.

Generates high-quality synthetic data for cost-effective knowledge graph creation

Reduces computational expense by enabling single-shot text-to-knowledge graph generation

Addresses limitations of iterative LLM prompting in complex relation extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic text-KG pairs from knowledge bases

Fine-tunes lightweight models for single-shot KG generation

Outperforms larger LLMs and state-of-the-art methods

🔎 Similar Papers

SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task