AmalREC: A Dataset for Relation Extraction and Classification Leveraging Amalgamation of Large Language Models

📅 2024-12-29

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing relation classification (RC) and extraction (RE) datasets suffer from limited relation type diversity and narrow domain coverage. To address this, we propose a five-stage LLM-driven pipeline integrating template-guided generation, a novel multidimensional Sentence Evaluation Index (SEI), and an SEI-Ranker filtering mechanism to achieve high-fidelity, high-diversity conversion of relation tuples into natural-language sentences. We introduce the first SEI-based quantification framework and an LLM-fused distillation strategy, enabling strategic ensemble of multi-model generations. We construct the largest general-purpose RE/RC benchmark to date—comprising 255 fine-grained relation types, 150K training sentences, and 15K test samples. Our method achieves significant improvements over state-of-the-art on standard RE/RC benchmarks, empirically demonstrating LLMs’ differential capabilities in relational semantic concretization, and providing a reproducible, methodology-driven data construction framework.

Technology Category

Application Category

📝 Abstract

Existing datasets for relation classification and extraction often exhibit limitations such as restricted relation types and domain-specific biases. This work presents a generic framework to generate well-structured sentences from given tuples with the help of Large Language Models (LLMs). This study has focused on the following major questions: (i) how to generate sentences from relation tuples, (ii) how to compare and rank them, (iii) can we combine strengths of individual methods and amalgamate them to generate an even bette quality of sentences, and (iv) how to evaluate the final dataset? For the first question, we employ a multifaceted 5-stage pipeline approach, leveraging LLMs in conjunction with template-guided generation. We introduce Sentence Evaluation Index(SEI) that prioritizes factors like grammatical correctness, fluency, human-aligned sentiment, accuracy, and complexity to answer the first part of the second question. To answer the second part of the second question, this work introduces a SEI-Ranker module that leverages SEI to select top candidate generations. The top sentences are then strategically amalgamated to produce the final, high-quality sentence. Finally, we evaluate our dataset on LLM-based and SOTA baselines for relation classification. The proposed dataset features 255 relation types, with 15K sentences in the test set and around 150k in the train set organized in, significantly enhancing relational diversity and complexity. This work not only presents a new comprehensive benchmark dataset for RE/RC task, but also compare different LLMs for generation of quality sentences from relational tuples.

Problem

Research questions and friction points this paper is trying to address.

Relation Classification

Data Diversity

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Relationship Diversity

Sentence Evaluation Index

🔎 Similar Papers

No similar papers found.