Auto-Cypher: Improving LLMs on Cypher generation via LLM-supervised generation-verification framework

📅 2024-12-17

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the dual challenges of low-generation quality and scarce annotated data in Text2Cypher tasks, this paper proposes an end-to-end LLM-supervised generate-verify pipeline. Methodologically, it introduces the novel “LLM-as-Database-Filler” strategy, wherein an LLM populates a synthetic graph database to enable semantic execution feedback for closed-loop Cypher correctness verification. We construct SynthCypher—the first large-scale, open-source synthetic dataset for Text2Cypher (29.8K samples)—spanning diverse query complexities. Our approach integrates graph database simulation, instruction fine-tuning, and multi-model adaptation (LLaMA-3.1-8B, Mistral-7B, Qwen-7B). Evaluated on standard Text2Cypher benchmarks, our method achieves up to a 40% absolute accuracy improvement; on the SPIDER-based graph database adaptation benchmark, it yields a 30% gain. These results demonstrate substantial enhancement in open-weight LLMs’ capability to generate precise, executable Cypher queries for production graph databases such as Neo4j.

Technology Category

Application Category

📝 Abstract

Graph databases like Neo4j are gaining popularity for handling complex, interconnected data, over traditional relational databases in modeling and querying relationships. While translating natural language into SQL queries is well-researched, generating Cypher queries for Neo4j remains relatively underexplored. In this work, we present an automated, LLM-Supervised, pipeline to generate high-quality synthetic data for Text2Cypher. Our Cypher data generation pipeline introduces LLM-As-Database-Filler, a novel strategy for ensuring Cypher query correctness, thus resulting in high quality generations. Using our pipeline, we generate high quality Text2Cypher data - SynthCypher containing 29.8k instances across various domains and queries with varying complexities. Training open-source LLMs like LLaMa-3.1-8B, Mistral-7B, and QWEN-7B on SynthCypher results in performance gains of up to 40% on the Text2Cypher test split and 30% on the SPIDER benchmark, adapted for graph databases.

Problem

Research questions and friction points this paper is trying to address.

Neo4j

Cypher generation

Text2Cypher performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Text2Cypher Dataset

LLM-As-Database-Filler

🔎 Similar Papers

SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task