GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the sharp performance degradation of zero-shot information extraction in unseen domains—caused by discrepancies in annotation guidelines—this paper proposes the first end-to-end guided synthetic data generation framework. It requires no human annotation, automatically discovering domain-specific patterns, inferring structured annotation guidelines, and generating high-quality, controllable synthetic data. The method leverages fine-tuned Llama 3.1 and integrates prompt-driven pattern discovery, guideline inference, structured instruction distillation, and consistency-based filtering to significantly enhance large language models’ understanding of and generalization across heterogeneous schemas. Evaluated on seven zero-shot named entity recognition benchmarks, it establishes new state-of-the-art results, achieving up to a +7.0 F1 improvement over prior best methods under pure zero-shot settings; when combined with a small amount of real data, it further improves by +1.9 F1.

Technology Category

Application Category

📝 Abstract
Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com
Problem

Research questions and friction points this paper is trying to address.

Improving zero-shot IE performance in unseen domains
Automating domain-specific schema and labeled data generation
Enhancing model comprehension of complex annotation schemas
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically defines domain-specific schemas
Generates synthetically labeled instances
Fine-tunes Llama 3.1 for zero-shot IE
🔎 Similar Papers
No similar papers found.
Neil De La Fuente
Neil De La Fuente
Student Researcher, Technical University of Munich
Deep LearningSynthetic DataComputer VisionSelf Supervised LearningNLP
Oscar Sainz
Oscar Sainz
University of the Basque Country (UPV/EHU)
Computer ScienceArtificial InteligenceNatural Language ProcessingInformation Extraction
I
Iker Garc'ia-Ferrero
HiTZ Basque Center for Language Technology - Ixa NLP Group, University of the Basque Country (UPV/EHU)
E
Eneko Agirre
HiTZ Basque Center for Language Technology - Ixa NLP Group, University of the Basque Country (UPV/EHU)