AmalREC: A Dataset for Relation Extraction and Classification Leveraging Amalgamation of Large Language Models

📅 2024-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing relation classification (RC) and extraction (RE) datasets suffer from limited relation type diversity and narrow domain coverage. To address this, we propose a five-stage LLM-driven pipeline integrating template-guided generation, a novel multidimensional Sentence Evaluation Index (SEI), and an SEI-Ranker filtering mechanism to achieve high-fidelity, high-diversity conversion of relation tuples into natural-language sentences. We introduce the first SEI-based quantification framework and an LLM-fused distillation strategy, enabling strategic ensemble of multi-model generations. We construct the largest general-purpose RE/RC benchmark to date—comprising 255 fine-grained relation types, 150K training sentences, and 15K test samples. Our method achieves significant improvements over state-of-the-art on standard RE/RC benchmarks, empirically demonstrating LLMs’ differential capabilities in relational semantic concretization, and providing a reproducible, methodology-driven data construction framework.

Technology Category

Application Category

📝 Abstract
Existing datasets for relation classification and extraction often exhibit limitations such as restricted relation types and domain-specific biases. This work presents a generic framework to generate well-structured sentences from given tuples with the help of Large Language Models (LLMs). This study has focused on the following major questions: (i) how to generate sentences from relation tuples, (ii) how to compare and rank them, (iii) can we combine strengths of individual methods and amalgamate them to generate an even bette quality of sentences, and (iv) how to evaluate the final dataset? For the first question, we employ a multifaceted 5-stage pipeline approach, leveraging LLMs in conjunction with template-guided generation. We introduce Sentence Evaluation Index(SEI) that prioritizes factors like grammatical correctness, fluency, human-aligned sentiment, accuracy, and complexity to answer the first part of the second question. To answer the second part of the second question, this work introduces a SEI-Ranker module that leverages SEI to select top candidate generations. The top sentences are then strategically amalgamated to produce the final, high-quality sentence. Finally, we evaluate our dataset on LLM-based and SOTA baselines for relation classification. The proposed dataset features 255 relation types, with 15K sentences in the test set and around 150k in the train set organized in, significantly enhancing relational diversity and complexity. This work not only presents a new comprehensive benchmark dataset for RE/RC task, but also compare different LLMs for generation of quality sentences from relational tuples.
Problem

Research questions and friction points this paper is trying to address.

Relation Classification
Data Diversity
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Relationship Diversity
Sentence Evaluation Index
🔎 Similar Papers
No similar papers found.
M
Mansi
Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati
P
Pranshu Pandya
Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati
M
M. Vora
Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati
S
Soumya Bharadwaj
Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati
Ashish Anand
Ashish Anand
Professor of Computer Science, IIT Guwahati, India
NLPClinical Data MiningComputational BiologyMachine Learning