TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

📅 2024-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak generalization and heavy reliance on manually annotated data hinder semantic parsing. To address this, we propose a goal-oriented synthetic data generation framework that dynamically co-generates highly relevant logical queries and their corresponding natural language questions—without human annotation—via entity-relation-driven hierarchical graph expansion and cross-layer logical composition. Leveraging large language models (LLMs), our method enables controllable query-to-question generation and enhances in-context learning. It significantly improves generalization under out-of-distribution settings and boosts sample efficiency. Evaluated on GrailQA and KBQA-Agent, our approach achieves +7.7 and +12.2 F1 gains, respectively, using only an open-weight 7B-parameter LLM—outperforming non-finetuned, closed-source LLM–based methods. Moreover, it demonstrates superior robustness and scalability.

Technology Category

Application Category

📝 Abstract
Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (TARGA), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that TARGA, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, TARGA also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
Problem

Research questions and friction points this paper is trying to address.

Semantic Parsing
Generalization Ability
Annotated Data Dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

TARGA
Automatic Synthetic Data Generation
Unseen Problem Handling
🔎 Similar Papers
No similar papers found.
Xiang Huang
Xiang Huang
Nanjing university, Tongyi Lab
KBQAInstruction followingAlignmentRL
J
Jiayu Shen
State Key Laboratory for Novel Software Technology, Nanjing University, China
S
Shanshan Huang
State Key Laboratory for Novel Software Technology, Nanjing University, China
Sitao Cheng
Sitao Cheng
University of Waterloo
NLPLanguage AgentsReasoning
Xiaxia Wang
Xiaxia Wang
University of Oxford
Neuro-Symbolic ReasoningSemantic SearchKnowledge Graph
Y
Yuzhong Qu
State Key Laboratory for Novel Software Technology, Nanjing University, China