GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

To address the sharp performance degradation of zero-shot information extraction in unseen domains—caused by discrepancies in annotation guidelines—this paper proposes the first end-to-end guided synthetic data generation framework. It requires no human annotation, automatically discovering domain-specific patterns, inferring structured annotation guidelines, and generating high-quality, controllable synthetic data. The method leverages fine-tuned Llama 3.1 and integrates prompt-driven pattern discovery, guideline inference, structured instruction distillation, and consistency-based filtering to significantly enhance large language models’ understanding of and generalization across heterogeneous schemas. Evaluated on seven zero-shot named entity recognition benchmarks, it establishes new state-of-the-art results, achieving up to a +7.0 F1 improvement over prior best methods under pure zero-shot settings; when combined with a small amount of real data, it further improves by +1.9 F1.

Technology Category

Application Category

📝 Abstract

Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com

Problem

Research questions and friction points this paper is trying to address.

Improving zero-shot IE performance in unseen domains

Automating domain-specific schema and labeled data generation

Enhancing model comprehension of complex annotation schemas

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically defines domain-specific schemas

Generates synthetically labeled instances

Fine-tunes Llama 3.1 for zero-shot IE

🔎 Similar Papers

No similar papers found.