TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

📅 2024-12-27

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Weak generalization and heavy reliance on manually annotated data hinder semantic parsing. To address this, we propose a goal-oriented synthetic data generation framework that dynamically co-generates highly relevant logical queries and their corresponding natural language questions—without human annotation—via entity-relation-driven hierarchical graph expansion and cross-layer logical composition. Leveraging large language models (LLMs), our method enables controllable query-to-question generation and enhances in-context learning. It significantly improves generalization under out-of-distribution settings and boosts sample efficiency. Evaluated on GrailQA and KBQA-Agent, our approach achieves +7.7 and +12.2 F1 gains, respectively, using only an open-weight 7B-parameter LLM—outperforming non-finetuned, closed-source LLM–based methods. Moreover, it demonstrates superior robustness and scalability.

Technology Category

Application Category

📝 Abstract

Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (TARGA), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that TARGA, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.

Problem

Research questions and friction points this paper is trying to address.

Semantic Parsing

Generalization Ability

Annotated Data Dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

TARGA

Automatic Synthetic Data Generation

Unseen Problem Handling

🔎 Similar Papers

Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data